Comparing outliers in two distributions

normal distributionoutliers

I apologize in advance as I am not well-versed in statistics, but I hope that this question makes sense.

I have 2 populations which are normally distributed and have a near-identical mean. I would like to know if the upper quantiles (say >0.99) differ – ie, does one population extend further than the other, and how can I see if the difference is significantly different?

One approach I have thought to take is to subset my dataset on just the upper quantiles, then plot the ECDF, but I don't know that this is a correct approach.

Best Answer

Two samples of size 1000 from the same normal population.

set.seed(2022)
x1 = rnorm(1000, 50, 7)
x2 = rnorm(1000, 50, 7)
q1 = quantile(x1,.95); q1
     95% 
61.17923 
q2 = quantile(x2,.95); q2
     95% 
61.98871 
dq = abs(diff(c(q1,q2)));  dq
      95% 
0.8094812

Is this an unusually large discrepancy?

set.seed(106)
m = 10^5;  dq.95 = numeric(m)
for (i in 1:m) {
 x1 = rnorm(1000, 50, 7)
 x2 = rnorm(1000, 50, 7)
 q1 =quantile(x1,.95)
 q2 =quantile(x2,.95)
 dq.95[i] = abs( diff(c(q1,q2) ))
 }
mean(dq.95 >= dq)
[1] 0.22008

No. Not unusually large; about 22% of such comparisons have 95th percentiles farther apart.

hist(dq.95, prob=T, col = "skyblue2")
abline(v = dq, col="red")

Note: As you might guess from the histogram, the distribution of the 95th percentile of a sufficiently large sample is approximately normal. The variance gets larger for percentiles in the far tails. This CLT for quantiles (except the min and max) a fundamental result in the theory of order statistics. Depending on the circumstances of your project, it might be worth your while to see if your samples are large enough to use this asymptotic result.

Related Solutions

Solved – Separating two populations from the sample

If I understand correctly, then you can just fit a mixture of two Normals to the data. There are lots of R packages that are available to do this. This example uses the mixtools package:

#Taken from the documentation
library(mixtools)
data(faithful)
attach(faithful)

#Fit two Normals
wait1 = normalmixEM(waiting, lambda = 0.5)
plot(wait1, density=TRUE, loglik=FALSE)

This gives:

Mixture of two Normals http://img294.imageshack.us/img294/4213/kernal.jpg

The package also contains more sophisticated methods - check the documentation.

Solved – Quantiles from the combination of normal distributions

Unfortunately, the standard normal (from which all others can be determined, since the normal is a location-scale family) quantile function does not admit a closed form (i.e. a 'pretty formula'). The closest thing to a closed form is that the standard normal quantile function is the function, $w$, that satisfies the differential equation

$$ \frac{d^2 w}{d p^2} = w \left(\frac{d w}{d p}\right)^2 $$

and the initial conditions $w(1/2) = 0$ and $w'(1/2) = \sqrt{2 \pi}$. In most computing environments there is a function that numerically calculates the normal quantile function. In R, you would type

qnorm(p, mean=mu, sd=sigma)

to get the $p$'th quantile of the $N(\mu, \sigma^2)$ distribution.

Edit: With a modified understanding of the problem, the data is generated from a mixture of normals, so that the density of the observed data is:

$$ p(x) = \sum_{i} w_{i} p_{i}(x) $$

where $\sum_{i} w_{i} = 1$ and each $p_{i}(x)$ is some normal density with mean $\mu_{i}$ and standard deviation $\sigma_{i}$. It follows that the CDF of the observed data is

$$ F(y) = \int_{-\infty}^{y} \sum_{i} w_{i} p_{i}(x) dx = \sum_{i} w_{i} \int_{-\infty}^{y} p_{i}(x) = \sum_{i} w_{i} F_{i}(y) $$

where $F_{i}(x)$ is the normal CDF with mean $\mu_{i}$ and standard deviation $\sigma_{i}$. Integration and summation can be interchanged because these integrals are finite. This CDF is continuous and easy enough to calculate on a computer, so the inverse CDF, $F^{-1}$, also known as the quantile function, can be calculated by doing a line search. I default to this option because no simple formula for the quantile function of a mixture of normals, as a function of the quantiles of the constituent distributions, comes to mind.

The following R code numerically calculates $F^{-1}$ using bisection for the line search. The function F_inv() is the quantile function, you need to supply the vector containing each $w_{i}, \mu_{i}, \sigma_{i}$ and the quantile to be solved for, $p$.

# evaluate the function at the point x, where the components 
# of the mixture have weights w, means stored in u, and std deviations
# stored in s - all must have the same length.
F = function(x,w,u,s) sum( w*pnorm(x,mean=u,sd=s) )

# provide an initial bracket for the quantile. default is c(-1000,1000). 
F_inv = function(p,w,u,s,br=c(-1000,1000))
{
   G = function(x) F(x,w,u,s) - p
   return( uniroot(G,br)$root ) 
}

#test 
# data is 50% N(0,1), 25% N(2,1), 20% N(5,1), 5% N(10,1)
X = c(rnorm(5000), rnorm(2500,mean=2,sd=1),rnorm(2000,mean=5,sd=1),rnorm(500,mean=10,sd=1))
quantile(X,.95)
    95% 
7.69205 
F_inv(.95,c(.5,.25,.2,.05),c(0,2,5,10),c(1,1,1,1))
[1] 7.745526

# data is 20% N(-5,1), 45% N(5,1), 30% N(10,1), 5% N(15,1)
X = c(rnorm(5000,mean=-5,sd=1), rnorm(2500,mean=5,sd=1),
      rnorm(2000,mean=10,sd=1), rnorm(500, mean=15,sd=1))
quantile(X,.95)
     95% 
12.69563 
F_inv(.95,c(.2,.45,.3,.05),c(-5,5,10,15),c(1,1,1,1))
[1] 12.81730

Best Answer

Related Solutions

Solved – Separating two populations from the sample

Solved – Quantiles from the combination of normal distributions

Related Question