Solved – How to estimate the parameters of a Gaussian distribution sample with outliers

estimationnormal distributionoutliers

I have a sample of length $N$, mostly taken from a Gaussian distribution with unknown mean and variance. Of those $N$ samples, some proportion of them (typically less than 1-2%) are outliers, taken from another Gaussian distribution with a larger mean (larger typically by 3+ standard deviations of the more common distribution). I'm interested in estimating the parameters of the more common "background" distribution so that I can isolate the outliers with some prescribed probability of false alarm.

Using the sample mean/variance are the obvious ways to attack this problem, but they are not sufficiently robust to outliers for my purposes (i.e. the presence of the outliers results in an additive bias in the estimates). I can estimate the mean more robustly using different sample statistics like the median or mode that are more resistant to outlier values. However, I'm not sure how to approach the estimation of the distribution variance in a similar manner.

Is there an accepted approach for such a problem?

Edit: I decided to add some more details in response to the commenters below who suggested perhaps fitting a Gaussian mixture model to the data. I'm not sure whether that would be a good approach or not. I want to estimate the parameters of the background distribution in order to select from two hypotheses:

  1. There are no outliers; the sample consists of a Gaussian distribution with unknown parameters.

  2. There are some outliers, located at unknown locations in the sample. While their exact distribution isn't known, I posit that it is reasonable to approximate it as a Gaussian distribution with a significantly larger mean than the background. If this hypothesis is true, I would like to identify the locations of all of those outliers.

So, I guess I was a bit misleading in the original question. I should have said that the sample "may contain some proportion of outliers."

In the case where there are no outliers of interest, I'm not sure that a GMM would give me good results. My goal is to identify the parameters of the underlying Gaussian distribution so that I can identify outliers with a known, controlled type I error probability. I'm going to look for some more information on robust methods for estimating the distribution's scale.

Best Answer

Multivariate Case

Some very applicable research has been done by Rousseeuw et. al. See the paper here that deals with the Minimum Covariance Algorithm (the paper is very readable, I would recommend reading it).

The papers deals with finding a subset of the data that minimizes the covariance matrix (hence find the data that is most Normal-looking). It is very fast, and many libraries are available in python and R

The univariate case

Although the above deals with multivariate data, there does exist formulas for the minCovDet problem for the univariate case. See this question's answer for some details.