Solved – Outlier detection in very small sets

algorithmsclassificationoutliers

I need to get as accurate as possible a value for the brightness of a mainly stable light source given twelve sample luminosity values. The sensor is imperfect, and the light can occasionally "flicker" brighter or darker, which can be ignored, hence my need for outlier detection (I think?).

I've done some reading up on various approaches here and can't decide on which approach to go for. The number of outliers is never known in advance and will often be zero. Flicker is generally a very large deviation from the stable brightness (enough to really mess with any average taken with a big one present), but not necessarily so.

Here's a sample set of 12 measurements for completeness of the question:

295.5214, 277.7749, 274.6538, 272.5897, 271.0733, 292.5856, 282.0986, 275.0419, 273.084, 273.1783, 274.0317, 290.1837

My gut feeling is there are probably no outliers in that particular set, although 292 and 295 look a little high.

So, my question is, what would be the best approach here? I should mention that the values come from taking the euclidean distance of the R G and B components of the light from a zero (black) point. It would be programmatically painful, but possible, to get back to these values if required. The euclidean distance was used as a measure of "overall strength" as I'm not interested in the color, just the strength of output. However, there's a reasonable chance that the flickers I mentioned have a different RGB composition to the usual output.

At the moment I am toying with some sort of function that will repeat until a stable membership of allowed measures is reached by:

  1. Finding the standard deviation
  2. Putting everything outside say 2 SDs into an ignore list
  3. Recalculating the average and SD with the ignore list excluded
  4. Re-deciding who to ignore based on the new average and SD (assess all 12)
  5. Repeat until stable.

Is there any value in that approach?

All comments gratefully accepted!

Best Answer

Outliers in small samples can always be very tricky to detect. In most cases actually I would advocate that if you feel that your data are not bluntly corrupted, an "outlierish" value might not be problematic and its exclusion might be unreasonable. Probably using robust statistical techniques will be more sensible and closer to a middle-ground solution. You have a small sample; try to make every sample point count. :)

Regarding your suggested approach: I would not hastily enforce a normality assumption to your data with a 68-95-99.7 rule on them (as you seem to somehow do with your 2SD heuristic rule). Chebyshev's inequality for once assumes a 75-88.9-93.8 rule on them which is clearly less rigid. Other "rules" also exist; the Definition and detection section in the Outlier lemma in wikipedia has a bundle of heuristics.

Here is another one: A free book reference I have come across on the matter, NIST/SEMATECH e-Handbook of Statistical Methods, presents the following idea by Iglewicz and Hoaglin (1993): Use modified $Z$-scores $M$ such that:

$M_i = .6745(x_i-\tilde{x})/MAD$

where $\tilde{x}$ is your median and MAD is the median absolute deviation of your sample. Then assume that absolute values of $M$ above 3.5 are potential outliers. It is a semi-parametric suggestion (as most of them are, the parameter here being the $3.5$). In your example case it would marginally exclude your 295.5 but clearly retain your 292.6 measure... (For what's worth I wouldn't exclude any values out of your example case.)

Again, given you have a really small sample, if you believe that your sample is not obviously corrupted (a human 9'4" tall), I would advise you not to exclude data hastily. Your "suspected outliers" might be uncorrupted data; their use could actually assist rather than harm your analysis.

Related Question