I am working on several types of system metrics which characterizes several components of an application. The metrics range from system metrics like cpu.utilization to network metrics and database metrics like bytes.out/bytes.in and response-time for apache and haproxy.
The assumption of a normal distribution doesn't seem to hold for these metrics because of dependence on the load the distribution skewed for almost constant load. Also, seasonality too might be present in few of the metrics.
The objective is to find out if there is a change in the trend in long term or if there was a breakout in the time series of these metrics at a given instant in real time.
What are the best approaches to come up with a generic breakout system for detection or do we need different approaches depending on nature of these metrics?
For breakout detection I am thinking of using t-test to check if some current window has significant change in means compared to some previous window or long term value of mean.
Any guidance on the approaches will be very helpful.
Update: Adding link to a few data sets.
Best Answer
There are several solutions to your problem. There are two forms of outliers:
I'm assuming you would need step 2 what you call as breakout detection. There are variety of methods and tools that could help you in this:
Open Source Software:
there are two commercial version, that I have worked with great success: 1. SAS using UCM and ARIMA frame works 2. SPSS time series outlier detection
It is beyond the scope of one answer to mention pros and cons of these methodologies. I must say RAD from Netflix and Breakout detection from twitter performs worse in your data. What this tells you in my opinion that Statisticians have developed elegant methods like the one in changepoint package that is able to easily detect breakpoints in your data. I have also had excellent success using SAS/SPSS.
Below are some of the results from applying all the 4 open source packages. Twitters breakout is the worst which does not recognize any breaks in your data. Netflix's RAD does point out all your additive outliers/pulses but fails to recognize level shift around data point ~1351.Both changepoint and breakpoint detects correctly level shifts in ~1351 and 1353 respectively. I'll expand my answer in the future. Let us know if this is what you are looking for.
output from changepoint and breakpoint:
Output from RAD (NEtflix) and Breakout detection (Twitter), both fail to recognize breakouts:
Twitter's Breakout detection: