Solved – How to identify outliers in server uptime performance data

outliersquantiles

I have a python script that creates a list of lists of server uptime and performance data, where each sub-list (or 'row') contains a particular cluster's stats. For example, nicely formatted it looks something like this:

-------  -------------  ------------  ----------  -------------------
Cluster  %Availability  Requests/Sec  Errors/Sec  %Memory_Utilization
-------  -------------  ------------  ----------  -------------------
ams-a    98.099          1012         678          91
bos-a    98.099          1111         12           91
bos-b    55.123          1513         576          22
lax-a    99.110          988          10           89
pdx-a    98.123          1121         11           90
ord-b    75.005          1301         123          100
sjc-a    99.020          1000         10           88
...(so on)...

So in list form, it might look like:

[[ams-a,98.099,1012,678,91],[bos-a,98.099,1111,12,91],...]

My question:

  • What's the best way to determine the outliers in each column? Or are outliers not necessarily the best way to attack the problem of finding 'badness'?

In the data above, I'd definitely want to know about bos-b and ord-b, as well as ams-a since it's error rate is so high, but the others can be discarded. Depending on the column, since higher is not necessarily worse, nor is lower, I'm trying to figure out the most efficient way to do this. Seems like numpy gets mentioned a lot for this sort of stuff, but not sure where to even start with it (sadly, I'm more sysadmin than statistician…). When I asked over at Stack Overflow, someone mentioned using numpy's scoreatpercentile function and throw out anything over 99th percentile – does that seem like a good idea?

(Cross-posted from stackoverflow, here: https://stackoverflow.com/questions/4606288)

Best Answer

Based on the way you phrase the question

are outliers not necessarily the best way to attack the problem of finding 'badness'?

It is not clear that you are looking for outliers. For example, it seems that you are interested in machines performing above/below some threshold.

As an example, if all of your servers were at 98 $\pm$ 0.1 % availability, a server at 100% availability would be an outlier, as would a server at 97.6% availability. But these may be within your desired limits.

On the other hand, there may be good reasons apriori to want to be notified of any server at less than 95% availability, whether or not there are one or many servers below this threshold.

For this reason, a search for outliers may not provide the information that you are interested in. The thresholds could be determined statistically based on historical data, e.g. by modeling error rate as poisson or percent availability as beta variables. In an applied setting, these thresholds could probably be determined based on performance requirements.

Related Question