Product Demand Forecasting – Forecasting for Thousands of Products Across Multiple Stores

forecastingmultilevel-analysispredictive-modelstime series

I'm currently working on a demand forecasting task, with data on tens of thousands of products across a couple thousand stores. More specifically,I have a few years' worth of daily sales data per product in each store, and my goal is to forecast the future sales of each item in each store, one day ahead; then two days ahead, etc.

So far I've considered breaking down each product-store pair into a single time series, and doing a forecast for each time series as was done in Neal Wagner's paper, Intelligent techniques for forecasting multiple time series in real-world systems. In other words, I will use only the historical information of a particular store's sales of the product to forecast the future sales of that product in that store.

However, I've been browsing Kaggle and competitions like CorporaciĆ³n Favorita Grocery Sales Forecasting suggest a different approach, which is to use the information from all stores and all products to predict future sales. As I understand it, historical sales information of all products in all stores are dumped into the training set, from which the model will learn to forecasts future sales. It's very different from traditional time series methods, but apparently, based on the results of the competition, it works.

The latter method seems promising and more robust. However, there's the problem of having to process hundreds of millions of data points.

Which method is more appropriate for my task? For those who have worked on similar problems, which methodology would you recommend?

Best Answer

I wouldn't recommend the approach used by Neal et al.. Their data is unique for two reasons:

  • They are working with food data, which is usually denser and more stable than other retail product sales data. A given location will be selling dozens of milk cartons or egg packs per week and will have been selling those same products for decades, compared to fashion or car parts where it is not unusual to have sales of one single item every 3 or 4 weeks, and data available for only a year or two.

  • They are forecasting for warehouses not stores. A single warehouse covers multiple stores, so their data is even more dense than average. In fact a warehouse is typically used as a natural aggregation/grouping level for stores, so they are already essentially performing a grouping of store data.

Because of the nature of their data, they can get away with modeling individual time series directly. But most retailers' data would be too sparse at the individual sku/store level for them to pull that off.

As zbicyclist said, this problem is usually approached using hierarchical or multi-echelon forecasting. Commercial demand forecasting packages all use some form of hierarchical forecasting

The idea is to group products and stores into similar product and regions, for which aggregate forecasts are generated and used to determine overall seasonality and trend, which are then spread down reconciled using a Top-Down approach with the baseline forecasts generated for each individual sku/store combination.

Besides the challenge zbicyclist mentioned, a bigger problem is that finding the optimal groupings of products and stores is a non-trivial task, which requires a combination of domain expertise and empirical analysis. Products and stores are usually grouped together in elaborate hierarchies (By department, supplier, brand, etc..for products, by region, climate, warehouse, etc...for location) which are then fed to the forecasting algorithm along with historical sales data itself.


Addressing meraxes comments

How about the methods used in the CorporaciĆ³n Favorita Grocery Sales Forecasting Kaggle Competition, where they allow the models to learn from the sales histories of several (possibly unrelated) products, without doing any explicit grouping? Is this still a valid approach?

They're doing the grouping implicitly by using store, item, famlily, class, cluster as categorical features.

I've just read through a bit of Rob Hyndman's section on hierarchical forecasting. It seems to me that doing a Top-Down approach provides reliable forecasts for aggregate levels; however, it has the huge disadvantage of losing of information due to aggregation which may affect forecasts for the bottom-level nodes. It may also be "unable to capture and take advantage of individual series characteristics such as time dynamics, special events".

Three points regarding this:

  • The disadvantage he points to depends on the grouping of the data. If you you aggregate all the products and stores, then yes this would be a problem. For example aggregating all the stores from all regions would muddy out any region specific seasonalities. But you should be aggregating up only to the relevant grouping, and as I pointed out, this will require some analysis and experimentation to find.
  • In the specific case of retail demand, we are not worried about "loosing information due to aggregation" because frequently the times series at the bottom nodes (i.e. SKU/Store) contain very little information, which is why we aggregate them up to the higher levels in the first place.
  • For SKU/store specific events, the way we approach it on my team is to remove the event specific effects prior to generating a forecast, and then adding them back later, after the forecast is generated. See here for details.
Related Question