Solved – When does it make sense to detect multivariate outliers instead of univariate ones

anomaly detectionoutliers

I do get the idea of univariate outliers and detecting them. However, I don't understand the idea of multivariate outliers.

More precisely, I would like to ask if detecting multivariate outliers only makes sense if there is a correlation between the variables and the relationship is linear? Other than that, why would we try to detect multivariate outliers?

When does it make sense to detect multivariate outliers instead of univariate ones?

Best Answer

A short answer is "Usually". And a question in turn is What you do you want to do with your data? In most applications it will not be just looking at data one variable at a time, but looking at variables in pairs, threes or all at once. So what matters most of all is whether you have outliers in the variable space you will be working in. If say that means analysis of an outcome together with several predictors, those variables define the space of concern.

enter image description here

Consider these two bivariate examples to see the main point. In the left-hand panel, one data point appears to be an outlier on both variables. Is it a problem for say correlation and regression? Not necessarily, as it is entirely consistent with an strong (in this case, exact) linear relationship between the variables. It would be prudent to think about it, and there might be grounds for distrust, but in general a univariate outlier need not be a problem for bivariate analysis.

In the right-hand panel, there is an extra data point. It would often not be regarded as a univariate outlier on either variable: there is a value much bigger in each case that makes it appear quite expected. But it is more awkward for bivariate analysis than any other point. In general, a bivariate outlier -- or at least a point behaving differently from the majority -- need not be also a univariate outlier.

It's harder to illustrate, but easy to imagine, that say in 3 dimensions, there can be points that are awkward but not univariate outliers for any variable and bivariate outliers only for some two-dimensional projections of the data. The real outliers, if they exist, can be hard to spot. And it gets worse with more dimensions, which is precisely why we have a menagerie of multivariate methods.

Emphasis on univariate outliers seems to follow from a mix of good and not so good reasons:

  • It is a good idea to look for univariate outliers, as many outliers can be found that way, and subject- matter knowledge can be applied most easily. A blood pressure of 0 or a human age of 999 years must be wrong. Or, if the data point appears to be plausibly extreme, say data for the Amazon (or Amazon), you have found it and are forewarned. (Most often, working with logarithms is the simplest and best way forward.)

  • It is often featured in introductory texts and courses and many people remember much of what they learned, which is good.

  • Scientists and social scientists and practitioners in many fields often pick up ideas from practice in their fields rather than from their training, perhaps many years before. Practice in many fields doesn't include much beyond univariate and bivariate techniques.

  • It is (much) harder to find bivariate and multivariate outliers in a truly general way. If your data are mostly a banana configuration, but there is an anomalous small orange configuration floating nearby, and that's all in a high-dimensional space, you might need really good methods or long and diligent scrutiny to spot that. Linked to this is that are many problems for datasets that may include outliers that don't yield to linear models or for which other approaches may appear useful.

  • There is a deep-seated myth that data must be normally distributed, which leads to some over-emphasis on checking for that, most often just one variable at a time.

Related Question