Solved – Difference between Outlier and Inlier

anomaly detectionoutliersresiduals

I have stumbled upon the term inlier in the LOF measure (Local Outlier Factor), I am familiar with the term of outliers (well basically liers – instances which doesn't behave as the rest of the instances).

What does 'Inliers' mean in the context of anomaly detection? and how it is related to (different from) outliers?

Best Answer

This is an area where there is a bit of inconsistency in terminology which has the unfortunate effect of confusing some statistical discussions. The concept of an "inlier" is generally used to refer to a data value that is in error (i.e., subject to measurement error) but is nonetheless in the "interior" of the distribution of the correctly measured values. By this definition the inlier has two aspects: (1) it is in the interior of the relevant distribution of values; and (2) it is an erroneous value. Contrarily, the corresponding notion of an "outlier" is usually used to refer to any data value that is far into the tails of the distribution, but without any definitional aspect assuming that it is in error. This terminology yields an unfortunate inconsistency, where an "inlier" is an erroneous data point (by definition) but an "outlier" is not necessarily an erroneous data point. Hence, under this terminology, the union of "inliers" and "outliers" does not correspond either to all the data, or even to all the erroneous data.

Dealing with outliers: I have discussed dealing with outliers in other questions here and here, but for convenience, I will repeat some of those remarks here. Outliers are points that are distant from the bulk of other points in a distribution, and diagnosis of an "outlier" is done by comparison of the data point to some assumed distributional form. Although outliers can occasionally be caused be measurement error, diagnosis of outliers can also occur when the data follows a distribution with high kurtosis (i.e., fat tails), but the analyst compares the data points to an assumed distributional form with low kurtosis (e.g., the normal distribution).

Flagging of "outliers" in outlier tests really just means that the model distribution you are using does not have fat enough tails to accurately represent the observed data. This could be because some of the data contains measurement error, or it could just be from a distribution with fat tails. Unless there is some reason to think that deviation from the assumed model form constitutes evidence of measurement error (which would require a theoretical basis for the distributional assumption), the presence of outliers generally means that you should change your model to use a distribution with fatter tails. It is inherently difficult to distinguish between measurement error and high kurtosis that is part of the underlying distribution.

Dealing with inliers (which really generally involves not dealing with them): Unless you have a source of external information indicating measurement error, it is essentially impossible to identify "inliers". By definition, these are data points that are in the "interior" of the distribution, where most of the other data occurs. Hence, it is not detected by tests that look for data that is an "aberation" from the other data points. (In some cases you can detect "inliers" that seem to be in the interior of a distribution, but are actually "outliers" when taken with respect to a more complex representation of the distribution. In this case the point is actually an outlier, but it only looks like it is in the interior of the distribution when you are using a crude distributional approximation.)

In some rare cases you might have an external source of information that identifies a subset of your data as being subject to measurement error (e.g., if you are conducting a large survey and you find out that one of your surveyors was just making up their data). In this case, any data points in that subset that are in the interior of the distribution are "inliers" and are known via external information to be subject to measurement error. In this case you would generally remove all the data known to be erroneous, even if some of it is "inliers" that are in the interior of the distribution where you would expect it to be. The point here is that a data point can be erroneous even if it is not in the tails of the distribution.