Solved – Can the mean lie outside the the interquartile range (Q1 and Q3)? and if so what does it do to the distribution

interquartilemean

Can the mean lie outside of the interquartile range? I realize that extreme outliers can affect or pull the mean, but can it pull the mean outside of the interval from the first quartile to the third quartile?

Best Answer

If "mean" refers to a statistic for a batch of data, then consider the dataset $(1,2,3,4,10^6)$ whose quartiles must lie between $1$ and $4$ (depending on how you compute them) but whose mean is $200,002$.

If instead it refers to a property of a distribution, then assign a probability of $1/5$ to each of the five numbers in the previous batch to create a (discrete) distribution. The same calculations apply, leading to the same conclusions.

The point is that quartiles are resistant to changes in the data, whereas the mean is sensitive to changes in even any one data value. When we add $\epsilon$ to any single value in a dataset of $n\gt 4$ numbers, the mean changes by $\epsilon/n$--which may be arbitrarily large--but the quartiles (if they change at all) only shift to the neighboring values in the original dataset and therefore are limited in how much they can change. The preceding example exploited this in an extreme way.

Influence functions study how such changes in data values create changes in statistical summaries of those values.

Related Solutions

Solved – Determine outliers using IQR or standard deviation

It seems like you have so many outliers that only looking at the residuals after fitting with, say, ordinary least squares might be misleading---some sample points could have so high influence (or "leverage") that the fit is changed, maybe misleading, just to make the residuals corresponding to those high leverage points small. So you probably needs robust fitting methods!

You can start by looking at the answers to How to optimize a regression by removing 10% "worst" data points? and maybe search this site for the robust tag: https://stats.stackexchange.com/questions/tagged/robust

Solved – How to calculate the 4th quartile from median and IQR

Note: In the following answer I assume that you only know the quantiles you mentioned and you do not know anything else about the distribution, for instance you do not know whether the distribution is symmetric or what its pdf or its (centralized) moments are.

It is not possible to calculate the 4th quartile, if you have only the median and the IQR.

Let us look at the following definitions:

median = second quartile.

IQR = third quartile $-$ first quartile.

The 4th quartile is in neither of these two equations. Therefore, it is impossible to calculate it with the information given.

Here is one example:

   x <- c(1,2,3,4,5,6,7,8,9,10)
   y <- c(1,2,3,4,5,6,7,8,9,20)

summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.25    5.50    5.50    7.75   10.00 

summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.25    5.50    6.50    7.75   20.00

The first quartile is for both "x" and "y" 3,25. Also the median is 5.5 for both. The third quartile is 7.75 for both and the IQR is 7.75 $-$ 3.25 = 4.5 for both. However, the 4th quartile, which is also the maximum, is different, namely 10 and 20.

You can also look at boxplots of x and y and you will see that the first quartile, the second quartile (median) and the third quartile are equal. Therefore, you cannot conclude anything about the rest of the distribution of the datapoints.

df <- data.frame(x,y)
p <- ggplot(stack(df), aes(x = ind, y = values)) + geom_boxplot()
p

Best Answer

Related Solutions

Solved – Determine outliers using IQR or standard deviation

Solved – How to calculate the 4th quartile from median and IQR

Related Question