Solved – Removing outliers non-normal data

outliers

enter image description here

Hi,

I was wondering if anyone had advice on removing outliers. In a practical experiment relative telomere length in DNA samples was measured in duplicate. Expected values are around 1, and the peak of samples between 0.0-0.5 are fairly certain failed samples.
Since this data isn't normally distributed and the samples are likely fails, I was looking for a valid method to get rid of obvious outliers. Any advice how to do this for this data?
I have tried Median Absolute Deviation already, but the peak remains.

Thanks!

Best Answer

The "failed" samples are not really "outliers", they are clearly a normal part of your experimental process. What you need to do is correctly model this process.

For example, it may be appropriate to assume a mixture of normals distribution. You would let $X_i \sim^{\text{iid}} \text{Bern}(p)$ be 0 if sample $i$ is a "failure", 1 otherwise, and then assume different conditional normal distributions for each outcome:

$$Y_i|(X_i = 0) \sim N(\mu_f,\sigma_f^2)$$ $$Y_i|(X_i = 1) \sim N(\mu_s,\sigma_s^2)$$

You only observe the $Y_i$ but you can estimate the parameter(s) of interest (which seems to be $\mu_s$, the mean of successful samples) using an algorithm like EM.

Related Question