Solved – Test to identify outliers with R

hypothesis testingoutliersr

I would like to identify outliers in a sample of 6,000 observations.

My variable doesn't follow a normal distribution and my series is spread out on the right. So an extreme value is not necessary an outliers… that's my problem.

I have not found any solution to identify outliers.

I saw that there are the test of Dixon, but it's restricted to normally distributed samples which contains less than 30 observations.

Best Answer

A statistical test has to do with replications of the experiment and a null hypothesis that is not "discovered" by the incidental finding of a outlying data point. For that reason, it doesn't make any sense to use a statistical test for data points, but you can use critical values or other criteria to flag observations as possible outliers, and then proceed accordingly to verify the data's accuracy.

Because of Chebyshev's inequality, you can always probabilistically quantify the distance of an observation from the mean in terms of a Z-score. The famous rule of Tukey identifies outliers based on a lower bound of normal of Q1 - 1.5 IQR and Q3 + 1.5 IQR. To give you a sense, in a normal distribution the upper bound comes out to a value of 2.70, which in a sample of 6,000 would flag about 21 observations irrespective of their actually being outliers.

Along those lines, it is fair to consider any rule that suits the problem to rank and classify outlying observations. Some ad hoc examples below:

  1. Use Tukey's test to flag outliers. With 6,000 you may set a FDR by simulation or something similar to scale the IQR by an even larger value as needed.
  2. Log transform the data if the data are concentrations or counts (due to biologic interest).
  3. Use a Box Cox transform to generate the optimally normal exponential change-of-variable and then apply normal tests.
  4. Use the Z-score to rank and flag outliers and choose a stringent alpha level critical value to flag outliers anyway.
  5. Use a known distribution suspected to form a data generating process, fit a QQ-plot to those data, and rank outliers in terms of mean-squared error from the calibration line.
  6. Use single-observation deletion and perform maximum likelihood to find which observation's deletion leads to the greatest improvement in likelihood.
Related Question