Solved – Choosing a k-value for Local Outlier Factor (LOF) detection analysis

data miningoutliers

I have a set of three-dimensional data, and I'm trying to use Local Outlier Factor analysis to identify the most unique or strange values. How does one decide the k-value to use in LOF analysis? I understand what the k-value determines, and so I'm not surprised that I'm seeing slightly different results using different k's, but I'm not sure if there are characteristics of my dataset that should push me toward one value over others. Thanks!

Best Answer

Posting this here for anyone who comes across my question in the future -- the original paper describing the local outlier factor algorithm, "LOF: Identifying Density-Based Local Outliers" (Breunig et al), recommends a method of choosing a k-value. As a reminder, the LOF algorithm compares the density of each point to the density of its $k$-closest neighbors. The authors of the paper recommend choosing a minimum $k$ and a maximum $k$, and for each point, taking the maximum LOF value over each $k$ in that range. They offer several guidelines for choosing the bounds.

For the minimum value, the LOF values fluctuate wildy the points in a uniform distribution for $k<10$, with points in a uniform distribution sometimes showing up as outliers, so they recommend at least $min(k)=10$. Secondly, the minimum $k$-value serves as a minimum size for something to be considered a "cluster", so that points can be outliers relative to that cluster. If $k=15$, and you have a group of $12$ points and a point $p$, each point in the group will include $p$ in its nearest neighbors, and $p$ will include those points, leading them to have very similar LOFs. So if you want to consider a point near a group of $N$ points as an outlier, rather than part of that group, your k value should be at least $N$.

For the maximum value, a similar criteria applies, in that it should be the maximum number of objects that you want to be considered outliers if clustered together. A group of $N$ objects isolated from the main set can either be a cluster, or $N$ outliers; for $k<N$, they will be the first; for $k>N$, they will be the second.

Hopefully this helps anyone with a similar problem. The full paper is here, and the discussion of max/min k-values begins on page 7 and goes through page 9. (They refer to the $k$-value as MinPts.)