Solved – Separating two populations from the sample

datasetexpectation-maximizationoutliers

I'm trying to separate two groups of values from a single data set. I can assume that one of the populations is normally distributed and is at least half the size of the sample. The values of the second one are both lower or higher than the values from the first one (distribution is unknown). What I'm trying to do is to find the upper and lower limits that would enclose the normally-distributed population from the other.

My assumption provide me with starting point:

  • all points within the interquartile range of the sample are from the normally-distributed population.

I'm trying to test for outliers taking them from the rest of the sample until they don't fit into the 3 st.dev of the normally-distributed population. Which is not ideal, but seem to produce reasonable enough result.

Is my assumption statistically sound? What would be a better way to go about this?

p.s. please fix the tags someone.

Best Answer

If I understand correctly, then you can just fit a mixture of two Normals to the data. There are lots of R packages that are available to do this. This example uses the mixtools package:

#Taken from the documentation
library(mixtools)
data(faithful)
attach(faithful)

#Fit two Normals
wait1 = normalmixEM(waiting, lambda = 0.5)
plot(wait1, density=TRUE, loglik=FALSE)

This gives:

Mixture of two Normals http://img294.imageshack.us/img294/4213/kernal.jpg

The package also contains more sophisticated methods - check the documentation.

Related Question