If you do a good enough job modeling the important predictor variables, you probably will not need to worry as much about the time series aspects (You should probably still test for serial correlation and adjust for it if needed).
Most of the times series style association you will see can easily be modeled by things like day of the week, holiday/vacation indicators, and time since the dvd release or some form of advertising or event that spurs rentals of a particular movie.
I wouldn't recommend the approach used by Neal et al.. Their data is unique for two reasons:
They are working with food data, which is usually denser and more stable than other retail product sales data. A given location will be selling dozens of milk cartons or egg packs per week and will have been selling those same products for decades, compared to fashion or car parts where it is not unusual to have sales of one single item every 3 or 4 weeks, and data available for only a year or two.
They are forecasting for warehouses not stores. A single warehouse covers multiple stores, so their data is even more dense than average. In fact a warehouse is typically used as a natural aggregation/grouping level for stores, so they are already essentially performing a grouping of store data.
Because of the nature of their data, they can get away with modeling individual time series directly. But most retailers' data would be too sparse at the individual sku/store level for them to pull that off.
As zbicyclist said, this problem is usually approached using hierarchical or multi-echelon forecasting. Commercial demand forecasting packages all use some form of hierarchical forecasting
The idea is to group products and stores into similar product and regions, for which aggregate forecasts are generated and used to determine overall seasonality and trend, which are then spread down reconciled using a Top-Down approach with the baseline forecasts generated for each individual sku/store combination.
Besides the challenge zbicyclist mentioned, a bigger problem is that finding the optimal groupings of products and stores is a non-trivial task, which requires a combination of domain expertise and empirical analysis. Products and stores are usually grouped together in elaborate hierarchies (By department, supplier, brand, etc..for products, by region, climate, warehouse, etc...for location) which are then fed to the forecasting algorithm along with historical sales data itself.
Addressing meraxes comments
How about the methods used in the CorporaciĆ³n Favorita Grocery Sales Forecasting Kaggle Competition, where they allow the models to learn from the sales histories of several (possibly unrelated) products, without doing any explicit grouping? Is this still a valid approach?
They're doing the grouping implicitly by using store, item, famlily, class, cluster as categorical features.
I've just read through a bit of Rob Hyndman's section on hierarchical forecasting. It seems to me that doing a Top-Down approach provides reliable forecasts for aggregate levels; however, it has the huge disadvantage of losing of information due to aggregation which may affect forecasts for the bottom-level nodes. It may also be "unable to capture and take advantage of individual series characteristics such as time dynamics, special events".
Three points regarding this:
- The disadvantage he points to depends on the grouping of the data. If you you aggregate all the products and stores, then yes this would be a problem. For example aggregating all the stores from all regions would muddy out any region specific seasonalities. But you should be aggregating up only to the relevant grouping, and as I pointed out, this will require some analysis and experimentation to find.
- In the specific case of retail demand, we are not worried about "loosing information due to aggregation" because frequently the times series at the bottom nodes (i.e. SKU/Store) contain very little information, which is why we aggregate them up to the higher levels in the first place.
- For SKU/store specific events, the way we approach it on my team is to remove the event specific effects prior to generating a forecast, and then adding them back later, after the forecast is generated. See here for details.
Best Answer
That you have daily aggregated sales information for only 1 or 2 months, even for thousands of products and their variations, limits the possible analyses. For instance, if your sales are strongly seasonal, e.g., as a function of the winter holidays or conversely the warm summer months, then you won't be able to integrate this potentially important information into the model.
Per your question, as I see it, you have two broad options: turn the process over to one of the many vendors out there providing automated retail solutions or, alternatively, do it yourself.
Regardless of which option you choose, you would be wise to do a whole lot of exploratory work on this dataset just so that you feel like you understand it. That way, if the solution used (whatever it is) returns nonsense, you will have a good sense of when that is occurring.
There are plenty of vendors of turnkey, automated retail solutions. Here are a few of the big names. Others can provide additional names:
IBM's Demandtec or omni-channel solutions http://www-01.ibm.com/software/info/demandtec/
McKinsey's Periscope for retailers http://www.periscope-solutions.com/
Planet Retail http://www1.planetretail.net/what-we-do
Khi Metrics http://www.groceryretailonline.com/doc/khi-metrics-0001
And in terms of DIY, given the massive volume of information and the relatively short time frame (~30 days), I doubt that traditional, univariate, "Box-Jenkins," ARIMA, VAR-type approaches lend themselves that readily to turnkey solutions. First of all, the approaches rely on many more data points just to initialize the lags and moving averages than 30 days. Second, and to the best of my knowledge, they aren't fully multivariate in the sense that a pooled or multilevel model might but others can disagree. Regarding the suggestion made to use Hyndman's functional time series analysis, I can't evaluate the adequacy of that recommendation.
I think you need to find a functional form for the model that is flexibly appropriate for:
1) The relatively short span of historic information
2) The massively categorical nature of the products
3) The hardware and software challenges of processing huge volumes of information
4) The need to update and produce compiled parameters on some regularly scheduled basis so that automated predictions can be made
Presumably -- or better, hopefully -- you aren't doing this work on a single laptop or workstation but have access to some sort of multi-core platform such as AWS integrated with software such as Ufora. Ufora offers massively parallel analyses on AWS. There are workarounds to the limits, even in the cloud, of RAM or working memory for statistical modeling. These include the many variants of so-called "divide and conquer" algorithms which, essentially, amount to a greatly extended random forests approach. I've heard of shops where they will execute 3 or 4 million "mini-models" or random forest resampling iterations in a few hours on a 100 core Hadoop platform, then roll it up on the back end. A good reference for this is Chen and Xie's http://dimacs.rutgers.edu/TechnicalReports/TechReports/2012/2012-01.pdf
In terms of the functional form, I think you're looking at a variation of multi-level modeling...whether that be pooled OLS, HLMs, GAMS, whatever, is a process that needs to be explored and cannot be determined in advance.