Solved – Good form to remove outliers

meanoutliersrobust

I'm working on statistics for software builds. I have data for each build on pass/fail and elapsed time and we generate ~200 of these/week.

The success rate is easy to aggregate, I can say that 45% passed any given week. But I'd like to aggregate elapsed time as well, and I want to make sure I don't misrepresent the data too badly. Figured I'd better ask the pros 🙂

Say I have 10 durations. They represent both pass and fail cases. Some builds fail immediately, which makes duration unusually short. Some hang during testing and eventually time out, causing very long durations. We build different products, so even successful builds vary between 90 seconds and 4 hours.

I might get a set like this:

[50, 7812, 3014, 13400, 21011, 155, 60, 8993, 8378, 9100]

My first approach was to get the median time by sorting the set and picking the mid-value, in this case 7812 (I didn't bother with the arithmetic mean for even-numbered sets.)

Unfortunately, this seems to generate a lot of variation, since I only pick out one given value. So if I were to trend this value it would bounce around between 5000-10000 seconds depending on which build was at the median.

So to smooth this out, I tried another approach — remove outliers and then calculate a mean over the remaining values. I decided to split it into tertiles and work only on the middle one:

[50, 60, 155, 3014, 7812, 8378, 8993, 9100, 13400, 21011] ->
[50, 60, 155], [3014, 7812, 8378, 8993], [9100, 13400, 21011] ->
[3014, 7812, 8378, 8993]

The reason this seems better to me is two-fold:

  • We don't want any action on the faster builds, they're already fine
  • The longest builds are likely timeout-induced, and will always be there. We have other mechanisms to detect those

So it seems to me that this is the data I'm looking for, but I'm worried that I've achieved smoothness by removing, well, truth.

Is this controversial? Is the method sane?

Thanks!

Best Answer

Your approach makes sense to me, taking your goal into account. It's simple, it's straightforward, it gets the job done, and you likely don't want to write a scientific paper about it.

One thing that one should always do in dealing with outliers is to understand them, and you already do a great job about this. So possible ways of improving your approach would be: can you use info on which builds are hanging? You mention that you have "other mechanisms to detect those" - can you detect them and then remove only those from the sample?

Otherwise, if you have more data, you could think about removing not tertiles, but quintiles... but at some point, this will likely not make much of a difference.

Related Question