Comparing outliers in two distributions

normal distributionoutliers

I apologize in advance as I am not well-versed in statistics, but I hope that this question makes sense.

I have 2 populations which are normally distributed and have a near-identical mean. I would like to know if the upper quantiles (say >0.99) differ – ie, does one population extend further than the other, and how can I see if the difference is significantly different?

One approach I have thought to take is to subset my dataset on just the upper quantiles, then plot the ECDF, but I don't know that this is a correct approach.

Best Answer

Two samples of size 1000 from the same normal population.

set.seed(2022)
x1 = rnorm(1000, 50, 7)
x2 = rnorm(1000, 50, 7)
q1 = quantile(x1,.95); q1
     95% 
61.17923 
q2 = quantile(x2,.95); q2
     95% 
61.98871 
dq = abs(diff(c(q1,q2)));  dq
      95% 
0.8094812 

Is this an unusually large discrepancy?

set.seed(106)
m = 10^5;  dq.95 = numeric(m)
for (i in 1:m) {
 x1 = rnorm(1000, 50, 7)
 x2 = rnorm(1000, 50, 7)
 q1 =quantile(x1,.95)
 q2 =quantile(x2,.95)
 dq.95[i] = abs( diff(c(q1,q2) ))
 }
mean(dq.95 >= dq)
[1] 0.22008

No. Not unusually large; about 22% of such comparisons have 95th percentiles farther apart.

hist(dq.95, prob=T, col = "skyblue2")
abline(v = dq, col="red")

enter image description here

Note: As you might guess from the histogram, the distribution of the 95th percentile of a sufficiently large sample is approximately normal. The variance gets larger for percentiles in the far tails. This CLT for quantiles (except the min and max) a fundamental result in the theory of order statistics. Depending on the circumstances of your project, it might be worth your while to see if your samples are large enough to use this asymptotic result.