Solved – Removal of multi-dimensional outliers

categorical dataoutliers

I have some issues in data reduction, and one expert advised me to remove the outliers and then move to Factor Analysis.

I want to remove outliers together, as I have 61 items, and box plots are not helpful as they would indicate outliers item by item. How can I detect the outliers at one time?

Best Answer

To make a long story short, you should use a tool such as robust PCA analysis. I may come back to this with a more substantive post, but the short version is explained in this post

Related Solutions

Solved – How to quickly identify participants responding randomly to self-report psychometric tests with many items

Jeromy's own answer is more useful than mine, but since I am working on this myself, I have collected a few handy R code bits that can also be helpful or steer one in the right direction of detecting 'suspicious' responses.

The assumption here is that participants often (I find) respond in a pattern of some sort. Especially so in longer questionnaires.

For example, if participants choose the middle response category from top to bottom of a given scale, one you could check how many responses are unique in the scale (or even the entire survey) by using:

table(apply(data,1, function(X) {length(unique(X))}))

If you interested to know which cases in your data present this pattern, rather than the total per category, you can use this code:

 apply(data, 1, function(X) all(abs(diff(X)) == abs(1)))

Yet another way of looking at this is by at those participants whose responses alternate the response categories in diagonal lines (e.g., 1, 2, 3, 4, 5, 4, 3, 2, 1 or 2, 3, 4, 3, 2, 3, 4 which use just the center response categories.) For these respondents, one would need to calculate the lagged differences between responses with the following code:

apply(data, 1, function(X) all(abs(diff(X)) == abs(2)))

Solved – How to interpret whiskers of a box plot when there are outliers

The question admits confusion and is itself contradictory. If whiskers extend to the extremes, then how can there be plotted data points beyond them?

Let's back up. (Incidentally, the term flier is I think not in general use statistically and I don't try to guess what it might mean, except presumably as a (loose) synonym for outlier.)

There is not a universal definition of box plots. Nor do published box plots always, or even often, come with an explanation of the precise rules used in their construction.

So much for the bad news.

In general, the box of a box plot shows the median and quartiles. A whisker is drawn outwards from each quartile and data points beyond the whiskers are shown individually. Occasionally, box plots are decorated with other summaries or details, such as point symbols showing means or labelling for selected outliers (this big river is "Amazon", or whatever).

Whiskers could extend

All the way to the maximum or minimum, in which case no data values are plotted as points beyond.
As defined by a convention used by Tukey, J.W. 1977. Exploratory data analysis. Reading, MA: Addison-Wesley. The interquartile range IQR is first calculated as IQR = upper quartile $-$ lower quartile. Tukey himself at various points used terms such as hinges and fourths for (approximate) quartiles, but these terms are slowly fading away; most recent treatments have reverted to the earlier term quartiles. There are several conventions for exactly how quartiles are calculated, which I will treat as small print here.
- One whisker is drawn from the lower quartile down to the lowest data point $\ge$ the lower quartile $-$ 1.5 IQR.
- The other whisker is drawn from the upper quartile to the highest data point $\le$ the upper quartile $+$ 1.5 IQR.
- Points beyond either whisker are plotted separately. These are often dubbed outliers, but keep reading for some opinionated discussion.
- Note that in principle that the whisker could be of zero length and not visible as such, as when for example the minimum and the lower quartile are identical, as is quite common with certain kinds of data. Example: counted data including zeros in which at least 25% of the values are zero.
- Some programs allow you to specify multipliers other than 1.5. Tukey himself before 1977 used multipliers of 1 and/or 2.
Other conventions exist, such as drawing whiskers to particular paired percentiles such as those for 1 and 99%, 5 and 95% and 10 and 90%. Usually, data points beyond those percentiles are shown individually as with #2. My personal view is that these conventions are easier to explain than #2.

I would assert that box plots do not show confidence intervals at all, in so far as any display of confidence intervals is usually called something else, or just explained as such.

On outliers, here is one opinion. Others may be found.

Tukey's intention was essentially that the researcher would think about how to handle data points plotted individually, and (for example) that a straggly box plot with outliers might point to analysis on a transformed scale.

It is quite common to see discussions in which lying beyond nearer quartile $\pm$ 1.5 IQR is treated as a criterion for outliers and not unusual to see inclinations to remove any such data points from further analyses. These attitudes don't match the intention behind that convention. Naturally, tools are often used in ways their creators did not intend.

Making a practical criterion for what is to be displayed separately a rigid rule for outlier identification is in my view misconceived in principle and if applied usually very poor practice. One reason is that some such values are expected in moderate or large samples even in a very well behaved distribution (e.g. a normal or Gaussian). More often, such data points are entirely to be expected in many skewed distributions. Far from being rogue points that should be excluded, such points are key parts of the data. Yet another, the context provided by other variables and/or subject-matter knowledge should inform any decision on outliers, even the decision just to flag or label them as such.

Best Answer

Related Solutions

Solved – How to quickly identify participants responding randomly to self-report psychometric tests with many items

Solved – How to interpret whiskers of a box plot when there are outliers

Related Question