Solved – Identifying subsets for outlier detection in local outlier factor

clusteringdensity functionmachine learningoutliers

I am trying to gain better understanding of the idea of local outliers (as discussed in this pdf) and how the function is implemented. Here are the key passages from the pdf:

  • Local outliers: Outliers comparing to their local neighborhoods, instead of the global data distribution
  • In Fig., o1 and o2 are local outliers to C1, o3 is a global outlier, but o4 is not an outlier. However, proximity-based clustering cannot find o1 and o2 are outlier (e.g., comparing with O4).
    1
  • Intuition (density-based outlier detection): The density around an outlier object is significantly different from the density around its neighbors

An here is the figure referenced:

enter image description here


Specific questions:

Say the overall set of data points (which contains c1,c2,c3) as in the figure is:

SP : {P1,P2,P3,P4,P5,P6,P7,P8}
  1. Should the set SP be sub divided into further sets so in order to find outliers?

  2. From above image it seems that it should?

  3. What determines if the set of points should be divided into subsets and then lof is applied against each of the subsets instead of the overall set?

  4. In the figure/pdf, 4 local outliers are produced but are these outliers of a single set or is each outlier just an outlier of a subset of the overall set?

  5. Perhaps each reachability density corresponds to a specific subset of the items?

Best Answer

LOF uses k-distance neighborhoods.

Doing clustering to detect outliers has been attempted several times, but none of these methods seems to be very popular; definitely not as popular as LOF.

The reason may be that outliers can make clustering harder; so you may also want to do the opposite: first remove local outliers, then cluster.

The clustering method DBSCAN also has a notion of "noise", so it does also "detect" density outliers. But then it does not make sense to run LOF anymore, when DBSCAN already flagged "outliers"... and other clustering methods such as k-means are sensitive to outliers, so you can't use them well either.

It seems that LOF and similar methods yield higher quality outliers than clustering based methods.