Solved – Modeling of real-time streaming data

modelingreal timesoftware

I am interested in tools/techniques that can be used for analysis of streaming data in "real-time"*, where latency is an issue. The most common example of this is probably price data from a financial market, although it also occurs in other fields (e.g. finding trends on Twitter or in Google searches).

In my experience, the most common software category for this is "complex event processing". This includes commercial software such as Streambase and Aleri or open-source ones such as Esper or Telegraph (which was the basis for Truviso).

Many existing models are not suited to this kind of analysis because they're too computationally expensive. Are any models** specifically designed to deal with real-time data? What tools can be used for this?


* By "real-time", I mean "analysis on data as it is created". So I do not mean "data that has a time-based relevance" (as in this talk by Hilary Mason).

** By "model", I mean a mathematical abstraction that describe the behavior of an object of study (e.g. in terms of random variables and their associated probability distributions), either for description or forecasting. This could be a machine learning or statistical model.

Best Answer

This area roughly falls into two categories. The first concerns stream processing and querying issues and associated models and algorithms. The second is efficient algorithms and models for learning from data streams (or data stream mining).

It's my impression that the CEP industry is connected to the first area. For example, StreamBase originated from the Aurora project at Brown/Brandeis/MIT. A similar project was Widom's STREAM at Stanford. Reviewing the publications at either of those projects' sites should help exploring the area.

A nice paper summarizing the research issues (in 2002) from the first area is Models and issues in data stream systems by Babcock et al. In stream mining, I'd recommend starting with Mining Data Streams: A Review by Gaber et al.

BTW, I'm not sure exactly what you're interested in as far as specific models. If it's stream mining and classification in particular, the VFDT is a popular choice. The two review papers (linked above) point to many other models and it's very contextual.