Forecastability – How to Assess the Forecastability of Time Series Data

Suppose i have a little over 20.000 monthly time series spanning from Jan'05 to Dec'11.
Each of these representing global sales data for a different product. What if, instead of computing forecasts for each and every one of them, I wanted to focus only on a small number of products that "actually matter"?

I could rank those products by total annual revenue and trim down the list using classical Pareto. Still it seems to me that, although they do not contribute much to bottom line, some products are so easy to forecast that leaving them out would be bad judjement. A product that sold 50$ worth each month for the past 10 years might not sound like much, but it requires so little effort to generate predictions about future sales that I might as well do it.

So let's say I divide my products in four categories: high revenue/easy to forecast – low revenue/easy to forecast – high revenue/hard to forecast – low revenue/hard to forecast.

I think it would be reasonable to leave behind only those time series belonging to the fourth group. But how exactly can I evaluate "forecastability"?

Coefficient of variation seems like a good starting point (I also remember seeing some paper about it a while ago). But what if my time series exhibit seasonality/level shifts/calendar effects/strong trends?

I would imagine I should base my evaluation only on variability of the random component and not the one of the "raw" data. Or am I missing something?

Has anybody stumbled upon a similar problem before? How would you guys go about it?

As always, any help is greatly appreciated!

Best Answer

Here's a second idea based on stl.

You could fit an stl decomposition to each series, and then compare the standard error of the remainder component to the mean of the original data ignoring any partial years. Series that are easy to forecast should have a small ratio of se(remainder) to mean(data).

The reason I suggest ignoring partial years is that seasonality will affect the mean of the data otherwise. In the example in the question, all series have seven complete years, so it is not an issue. But if the series extended part way into 2012, I suggest the mean is computed only up to the end of 2011 to avoid seasonal contamination of the mean.

This idea assumes that mean(data) makes sense -- that is that the data are mean stationary (apart from seasonality). It probably wouldn't work well for data with strong trends or unit roots.

It also assumes that a good stl fit translates into good forecasts, but I can't think of an example where that wouldn't be true so it is probably an ok assumption.

Best Answer

Related Solutions

Solved – Time series with multiple variables and different start date

Related Question