Solved – Is it OK to remove outliers from data

outliers

I looked for a way to remove outliers from a dataset and I found this question.

In some of the comments and answers to this question, however, people mentioned that it is bad practice to remove outliers from the data.

In my dataset I have several outliers that very likely are just due to measurement errors. Even if some of them are not, I have no way of checking it case by case, because there are too many data points. Is it statistically valid than just to remove the outliers? Or, if not, what could be another solution?

If I just leave those points there, they influence e.g. the mean in a way that does not reflect reality (because most of them are errors anyway).

EDIT: I am working with skin conductance data. Most of the extreme values are due to artifacts like somebody pulling on the wires.

EDIT2: My main interest in analyzing the data is to determine if there is a difference between two groups

Best Answer

One option is to exclude outliers, but IMHO that is something you should only do if you can argue (with almost certainty) why such points are invalid (e.g. measurement equipment broke down, measurement method was unreliable for some reason, ...). E.g. in frequency domain measurements, DC is often discarded since many different terms contribute to DC, quite often unrelated to the phenomenon you are trying to observe.

The problem with removing outliers, is that to determine which points are outliers, you need to have a good model of what is or is not "good data". If you are unsure about the model (which factors should be included, what structure does the model have, what are the assumptions of the noise, ...), then you cannot be sure about your outliers. Those outliers might just be samples that are trying to tell you that your model is wrong. In other words: removing outliers will reinforce your (incorrect!) model, instead of allowing you to obtain new insights!

Another option, is to use robust statistics. E.g. the mean and standard deviation are sensitive to outliers, other metrics of "location" and "spread" are more robust. E.g. instead of the mean, use the median. Instead of standard deviation, use inter-quartile range. Instead of standard least-squares regression, you could use robust regression. All those robust methods de-emphasize the outliers in one way or another, but they typically do not remove the outlier data completely (i.e. a good thing).

Related Question