Solved – Confidence interval for quantiles: distribution-free, asymptotic and assuming a normal distribution

asymptoticsconfidence intervalexact-testquantiles

An acquaintance of mine has been using this wrong inference formula for years: given

a i.i.d. sample $\mathbf{X}={X_1,\dots,X_N}$ for a continuous RV $X$,
sample mean $\bar{X}=\frac{\sum X_i}{N}$ and sample standard deviation $\bar{\sigma}=\frac{\sum \left(X_i-\bar{X}\right)^2}{N-1}$

estimate the 0.95-quantile $q_{0.95}$ as

$$q_{0.95} = \bar{X} + 2 \bar{\sigma}$$

(which is not even a decent point estimate – you should at the very least use $q_{0.95} = \bar{X} + 1.645\bar{\sigma}$).

What are the correct confidence intervals for a generic $q$, in the three cases:

we know nothing on $X$ (apart from the fact that it's continuous), and we look for an exact (non-asymptotic) answer. I think this answer for the median could be modified for a generic quantile
as before, but an asymptotic solution is fine. I guess there should be at least a couple answers here…one for quantiles which aren't close to 0 or 1, and one for quantiles which are. Maybe one based on normal approximation and one based on Poisson?
finally, we assume $X$ to have a Gaussian distribution with unknown mean and variance.

Best Answer

The first case was answered in detail in this question.
One example of the second case is shown here, where the authors apply a normal approximation to the binomial distribution used in calculations of the first case.

The third case is given by Hahn and Meeker in their handbook Statistical Intervals (2nd ed., Wiley 2017):

A two-sided $100(1-\alpha)\%$ confidence interval for $x_q$, the $q$ quantile of the normal distribution, is

$$ \left[\bar{x}-t_{(1-\alpha/2;\,n-1,\,\delta)}\frac{s}{\sqrt{n}},\;\bar{x}-t_{(\alpha/2;\,n-1,\,\delta)}\frac{s}{\sqrt{n}}\right] $$ where $t_{(\gamma;\,n-1,\,\delta)}$ is the $\gamma$ quantile of a noncentral $t$-distribution with $n-1$ degrees of freedom and noncentrality parameter $\delta = -\sqrt{n}z_{(q)}=\sqrt{n}z_{(1-p)}$.

Here, $z_{(q)}$ denotes $\Phi^{-1}(q)$, the $q$ quantile of the standard normal distribution.

For example, let's assume we drew $n=20$ samples from a normal distribution with unknown mean and standard deviation. The sample mean was $\bar{x}=10.5$ and the sample standard deviation was $s=3.19$. Then, the two-sided $95\%$ confidence interval for the $q=0.25$ quantile $x_{0.25}$ would be given by $(6.42; 9.76)$.

Here is some R code and a small simulation to check the coverage. By changing the parameters, you can run your own simulations:

normquantCI <- function(x, conf_level = 0.95, q = 0.5) {
  
  x <- na.omit(x)
  n <- length(x)
  xbar <- mean(x)
  s <- sd(x)
  ncp <- -sqrt(n)*qnorm(q)
  tval <- qt(c((1 + conf_level)/2, (1 - conf_level)/2), n - 1, ncp)
  se <- s/sqrt(n)
  
  xbar - tval*se
  
}

# Simulate the coverage

set.seed(142857)

q <- 0.25 # Quantile to calculate the CI for
conf_level <- 0.95 # Confidence level
true_mean <- 100 # The true mean of the normal distribution
true_sd <- 15 # True sd of the normal distribution
sampsi <- 20 # The sample size

trueq <- qnorm(q, true_mean, true_sd) # The true quantile

res <- replicate(1e5, {
  citmp <- normquantCI(rnorm(sampsi, true_mean, true_sd), conf_level = conf_level, q = q)
  ifelse(citmp[1] < trueq & citmp[2] > trueq, 1, 0)
})

sum(res)/length(res)
[1] 0.95043

Related Solutions

Solved – Asymptotic distribution of sample variance of non-normal sample

To side-step dependencies arising when we consider the sample variance, we write

$$(n-1)s^2 = \sum_{i=1}^n\Big((X_i-\mu) -(\bar x-\mu)\Big)^2$$

$$=\sum_{i=1}^n\Big(X_i-\mu\Big)^2-2\sum_{i=1}^n\Big((X_i-\mu)(\bar x-\mu)\Big)+\sum_{i=1}^n\Big(\bar x-\mu\Big)^2$$

and after a little manipualtion,

$$=\sum_{i=1}^n\Big(X_i-\mu\Big)^2 - n\Big(\bar x-\mu\Big)^2$$

Therefore

$$\sqrt n(s^2 - \sigma^2) = \frac {\sqrt n}{n-1}\sum_{i=1}^n\Big(X_i-\mu\Big)^2 -\sqrt n \sigma^2- \frac {\sqrt n}{n-1}n\Big(\bar x-\mu\Big)^2 $$

Manipulating,

$$\sqrt n(s^2 - \sigma^2) = \frac {\sqrt n}{n-1}\sum_{i=1}^n\Big(X_i-\mu\Big)^2 -\sqrt n \frac {n-1}{n-1}\sigma^2- \frac {n}{n-1}\sqrt n\Big(\bar x-\mu\Big)^2 $$

$$=\frac {n\sqrt n}{n-1}\frac 1n\sum_{i=1}^n\Big(X_i-\mu\Big)^2 -\sqrt n \frac {n-1}{n-1}\sigma^2- \frac {n}{n-1}\sqrt n\Big(\bar x-\mu\Big)^2$$

$$=\frac {n}{n-1}\left[\sqrt n\left(\frac 1n\sum_{i=1}^n\Big(X_i-\mu\Big)^2 -\sigma^2\right)\right] + \frac {\sqrt n}{n-1}\sigma^2 -\frac {n}{n-1}\sqrt n\Big(\bar x-\mu\Big)^2$$

The term $n/(n-1)$ becomes unity asymptotically. The term $\frac {\sqrt n}{n-1}\sigma^2$ is determinsitic and goes to zero as $n \rightarrow \infty$.

We also have $\sqrt n\Big(\bar x-\mu\Big)^2 = \left[\sqrt n\Big(\bar x-\mu\Big)\right]\cdot \Big(\bar x-\mu\Big)$. The first component converges in distribution to a Normal, the second convergres in probability to zero. Then by Slutsky's theorem the product converges in probability to zero,

$$\sqrt n\Big(\bar x-\mu\Big)^2\xrightarrow{p} 0$$

We are left with the term

$$\left[\sqrt n\left(\frac 1n\sum_{i=1}^n\Big(X_i-\mu\Big)^2 -\sigma^2\right)\right]$$

Alerted by a lethal example offered by @whuber in a comment to this answer, we want to make certain that $(X_i-\mu)^2$ is not constant. Whuber pointed out that if $X_i$ is a Bernoulli $(1/2)$ then this quantity is a constant. So excluding variables for which this happens (perhaps other dichotomous, not just $0/1$ binary?), for the rest we have

$$\mathrm{E}\Big(X_i-\mu\Big)^2 = \sigma^2,\;\; \operatorname {Var}\left[\Big(X_i-\mu\Big)^2\right] = \mu_4 - \sigma^4$$

and so the term under investigation is a usual subject matter of the classical Central Limit Theorem, and

$$\sqrt n(s^2 - \sigma^2) \xrightarrow{d} N\left(0,\mu_4 - \sigma^4\right)$$

Note: the above result of course holds also for normally distributed samples -but in this last case we have also available a finite-sample chi-square distributional result.

Solved – Confidence interval for the mean – Normal distribution or Student’s t-distribution

1. Normal data, variance known: If you have observations $X_1, X_2, \dots, X_n$ sampled at random from a normal population with unknown mean $\mu$ and known standard deviation $\sigma,$ then a 95% confidence interval (CI) for $\mu$ is $\bar X \pm 1.95 \sigma/\sqrt{n}.$ This is the only situation in which the z interval is exactly correct.

2. Nonnormal data, variance known: If the population distribution is not normal and the sample is 'large enough', then $\bar X$ is approximately normal and the same formula provides an approximate 95% CI. The rule that $n \ge 30$ is 'large enough' is unreliable here. If the population distribution is heavy-tailed, then $\bar X$ may not have a distribution that is close to normal (even if $n \ge 30).$ The 'Central Limit Theorem', often provides reasonable approximations for moderate values of $n,$ but it is a limit theorem, with guaranteed results only as $n \rightarrow \infty.$

3. Normal data, variance unknown. If you have observations $X_1, X_2, \dots, X_n$ sampled at random from a normal population with unknown mean $\mu$ and standard deviation $\sigma,$ with $\mu$ estimated by the sample mean $\bar X$ and $\sigma$ estimated by the sample standard deviation $S.$ Then a 95% confidence interval (CI) for $\mu$ is $\bar X \pm t^* S/\sqrt{n},$ where $S$ is the sample standard deviation and where $t^*$ cuts probability $0.025$ from the upper tail of Student's t distribution with $n - 1$ degrees of freedom. This is the only situation in which the t interval is exactly correct.

Examples: If $n=10$, then $t^* = 2.262$ and if $n = 30,$ then $t^* = 2.045.$ (Computations from R below; you could also use a printed 't table'.)

qt(.975, 9);  qt(.975, 29)
[1] 2.262157  # for n = 10
[1] 2.04523   # for n = 30

Notice that 2.045 and 1.96 (from Part 1 above) both round to 2.0. If $n \ge 30$ then $t^*$ rounds to 2.0. That is the basis for the 'rule of 30', often mindlessly parroted in other contexts where it is not relevant.

There is no similar coincidental rounding for CIs with confidence levels other than 95%. For example, in Part 1 above a 99% CI for $\mu$ is obtained as $\bar X \pm 2.58 \sigma/\sqrt{n}.$ However, $t^*=2.76$ for $n = 30$ and $t^* = 2.65$ for $n = 70.$

qnorm(.995)
[1] 2.575829
qt(.995, 29)
[1] 2.756386
qt(.995, 69)
[1] 2.648977

4. Nonnormal data, variance unknown: Confidence intervals based on the t distribution (as in Part 3 above) are known to be 'robust' against moderate departures from normality. (If $n$ is very small, there should be no far outliers or evidence of severe skewness.) Then, to a degree that is difficult to predict, a t CI may provide a useful CI for $\mu.$ By contrast, if the type of distribution is known, it may be possible to find an exact form of CI.

For example, if $n = 30$ observations from a (distinctly nonnormal) exponential distribution with unknown mean $\mu$ have $\bar X = 17.24,\, S = 15.33,$ then the (approximate) 95% t CI is $(11.33, 23.15).$

t.test(x)

        One Sample t-test

data:  x
t = 5.9654, df = 29, p-value = 1.752e-06
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 11.32947 23.15118
sample estimates:
mean of x 
 17.24033

However, $$\frac{\bar X}{\mu} \sim \mathsf{Gamma}(\text{shape}=n,\text{rate}=n),$$ so that $$P(L \le \bar X/\mu < U) = P(\bar X/U < \mu < \bar X/L)=0.95$$ and an exact 95% CI for $\mu$ is $(\bar X/U,\, \bar X/L) = (12.42, 25.16).$

qgamma(c(.025,.975), 30, 30)
[1] 0.6746958 1.3882946
mean(x)/qgamma(c(.975,.025), 30, 30)
[1] 12.41835 25.55274

Addendum on bootstrap CI: If data seem non-normal, but the actual population distribution is unknown, then a 95% nonparametric bootstrap CI may be the best choice. Suppose we have $n=20$ observations from an unknown distribution, with $\bar X$ = 13.54$ and values shown in the stripchart below.

The observations seem distinctly right-skewed and fail a Shapio-Wilk normality test with P-value 0.001. If we assume the data are exponential and use the method in Part 4, the 95% CI is $(9.13, 22.17),$ but we have no way to know whether the data are exponential.

Accordingly, we find a 95% nonparametric bootstrap in order to approximate $L^*$ and $U^*$ such that $P(L^* < D = \bar X/\mu < U^*) \approx 0.95.$ In the R code below the suffixes .re indicate random 're-sampled' quantities based on $B$ samples of size $n$ randomly chosen without replacement from among the $n = 20$ observations. The resulting 95% CI is $(9.17, 22.71).$ [There are many styles of bootstrap CIs. This one treats $\mu$ as if it is a scale parameter. Other choices are possible.]

B = 10^5; a.obs = 13.54
d.re = replicate(B, mean(sample(x, 20, rep=T))/a.obs)
UL.re = quantile(d.re, c(.975,.025))
a.obs/UL.re
    97.5%      2.5%
 9.172171 22.714980

Best Answer

Related Solutions

Solved – Asymptotic distribution of sample variance of non-normal sample

Solved – Confidence interval for the mean – Normal distribution or Student’s t-distribution

Related Question