Solved – Analysis of time series with many zero values

correlationcrostons-methodintermittent time seriestime series

This problem is actually about fire detection, but it is strongly analogous to some radioactive decay detection problems. The phenomena being observed is both sporadic and highly variable; thus, a time series will consist of long strings of zeroes interrupted by variable values.

The objective is not merely capturing events (breaks in the zeroes), but quantitative characterization of the events themselves. However, the sensors are limited, and thus will sometimes record zero even if the "reality" is non-zero. For this reason, zeroes must be included when comparing sensors.

Sensor B might be more sensitive than Sensor A, and I would like to be able to describe that statistically. For this analysis, I do not have "truth," but I do have a Sensor C, which is independent of Sensors A&B. Thus my expectation is that better agreement between A/B and C indicates better agreement with "truth." (This may seem shaky, but you'll have to trust me– I'm on solid ground here, based on what is known from other studies about the sensors).

The problem, then, is how to quantify "better agreement of time series." Correlation is the obvious choice, but will be affected by all those zeroes (which cannot be left out), and of course disproportionately affected by the maximum values. RMSE could also be calculated, but would be strongly weighted toward the behavior of the sensors in the near-zero case.

Q1: What is the best way to apply a logarithmic scaling to non-zero values that will then be combined with zeroes in a time-series analysis?

Q2: What "best practices" can you recommend for a time-series analysis of this type, where behavior at non-zero values is the focus, but zero values dominate and cannot be excluded?

Best Answer

To restate your question “ How does the analyst deal with long periods of no demand that follow no specific pattern?”

The answer to your question is Intermittent Demand Analysis or Sparse Data Analysis. This arises normally when you have "lots of zeros" relative to the number of non-zeros.The issue is that there are two random variables ; the time between events and the expected size of the event. As you said the autocorrelation (acf) of the complete set of readings is meaningless due to the sequence of zeroes falsely enhancing the acf. You can pursue threads like "Croston's method” which is a model-based procedure rather than a data-based procedure. Croston's method is vulnerable to outliers and changes/trends/level shifts in the rate of demand i.e. the demand divided by the number of periods since the last demand. A much more rigorous approach might be to pursue "Sparse Data - Unequally Spaced Data" or searches like that. A rather ingenious solution was suggested to me by Prof. Ramesh Sharda of OSU and I have been using it for a number of years in my consulting practice. If a series has time points where sales arise and long periods of time where no sales arise it is possible to convert sales to sales per period by dividing the observed sales by the number of periods of no sales thus obtaining a rate. It is then possible to identify a model between rate and the interval between sales culminating in a forecasted rate and a forecasted interval. You can find out more about this at autobox.com and google "intermittent demand"