Solved – Quantiles from the combination of normal distributions

aggregationgaussian mixture distributionnormal distributionquantiles

I have information on the distributions of anthropometric dimensions (like shoulder span) for children of different ages. For each age and dimension, I have mean, standard deviation. (I also have eight quantiles, but I don't think I'll be able to get what I want from them.)

For each dimension, I would like to estimate particular quantiles of the length distribution. If I assume that each of the dimensions is normally distributed, I can do this with the means and standard deviations. Is there a pretty formula that I can use to get the value associated with a particular quantile of the distribution?

The reverse is quite easy: For a particular value, get the area to the right of the value for each of the normal distributions (ages). Sum the results and divide by the number of distributions.

Update: Here's the same question in graphical form. Assume that each of the colored distributions is normally distributed.

Also, I obviously can just try a bunch of different lengths and keep changing them until I get one that's close enough to the desired quantile for my precision. I'm wondering if there is a better way than this. And if this is the right approach, is there a name for it?

Best Answer

Unfortunately, the standard normal (from which all others can be determined, since the normal is a location-scale family) quantile function does not admit a closed form (i.e. a 'pretty formula'). The closest thing to a closed form is that the standard normal quantile function is the function, $w$, that satisfies the differential equation

$$ \frac{d^2 w}{d p^2} = w \left(\frac{d w}{d p}\right)^2 $$

and the initial conditions $w(1/2) = 0$ and $w'(1/2) = \sqrt{2 \pi}$. In most computing environments there is a function that numerically calculates the normal quantile function. In R, you would type

qnorm(p, mean=mu, sd=sigma)

to get the $p$'th quantile of the $N(\mu, \sigma^2)$ distribution.

Edit: With a modified understanding of the problem, the data is generated from a mixture of normals, so that the density of the observed data is:

$$ p(x) = \sum_{i} w_{i} p_{i}(x) $$

where $\sum_{i} w_{i} = 1$ and each $p_{i}(x)$ is some normal density with mean $\mu_{i}$ and standard deviation $\sigma_{i}$. It follows that the CDF of the observed data is

$$ F(y) = \int_{-\infty}^{y} \sum_{i} w_{i} p_{i}(x) dx = \sum_{i} w_{i} \int_{-\infty}^{y} p_{i}(x) = \sum_{i} w_{i} F_{i}(y) $$

where $F_{i}(x)$ is the normal CDF with mean $\mu_{i}$ and standard deviation $\sigma_{i}$. Integration and summation can be interchanged because these integrals are finite. This CDF is continuous and easy enough to calculate on a computer, so the inverse CDF, $F^{-1}$, also known as the quantile function, can be calculated by doing a line search. I default to this option because no simple formula for the quantile function of a mixture of normals, as a function of the quantiles of the constituent distributions, comes to mind.

The following R code numerically calculates $F^{-1}$ using bisection for the line search. The function F_inv() is the quantile function, you need to supply the vector containing each $w_{i}, \mu_{i}, \sigma_{i}$ and the quantile to be solved for, $p$.

# evaluate the function at the point x, where the components 
# of the mixture have weights w, means stored in u, and std deviations
# stored in s - all must have the same length.
F = function(x,w,u,s) sum( w*pnorm(x,mean=u,sd=s) )

# provide an initial bracket for the quantile. default is c(-1000,1000). 
F_inv = function(p,w,u,s,br=c(-1000,1000))
{
   G = function(x) F(x,w,u,s) - p
   return( uniroot(G,br)$root ) 
}

#test 
# data is 50% N(0,1), 25% N(2,1), 20% N(5,1), 5% N(10,1)
X = c(rnorm(5000), rnorm(2500,mean=2,sd=1),rnorm(2000,mean=5,sd=1),rnorm(500,mean=10,sd=1))
quantile(X,.95)
    95% 
7.69205 
F_inv(.95,c(.5,.25,.2,.05),c(0,2,5,10),c(1,1,1,1))
[1] 7.745526

# data is 20% N(-5,1), 45% N(5,1), 30% N(10,1), 5% N(15,1)
X = c(rnorm(5000,mean=-5,sd=1), rnorm(2500,mean=5,sd=1),
      rnorm(2000,mean=10,sd=1), rnorm(500, mean=15,sd=1))
quantile(X,.95)
     95% 
12.69563 
F_inv(.95,c(.2,.45,.3,.05),c(-5,5,10,15),c(1,1,1,1))
[1] 12.81730

Related Solutions

Solved – Compute quantile of sum of distributions from particular quantiles

$q_Z$ could be anything.

To understand this situation, let us make a preliminary simplification. By working with $Y_i = X_i - q_i$ we obtain a more uniform characterization

$$\alpha = \Pr(X_i \le q_i) = \Pr(Y_i \le 0).$$

That is, each $Y_i$ has the same probability of being negative. Because

$$W = \sum_i Y_i = \sum_i X_i - \sum_i q_i = Z - \sum_i q_i,$$

the defining equation for $q_Z$ is equivalent to

$$\alpha = \Pr(Z \le q_Z) = \Pr(Z - \sum_i q_i \le q_Z - \sum_i q_i) = \Pr(W \le q_W)$$

with $q_Z = q_W + \sum_i q_i$.

What are the possible values of $q_W$? Consider the case where the $Y_i$ all have the same distribution with all probability on two values, one of them negative ($y_{-}$) and the other one positive ($y_{+}$). The possible values of the sum $W$ are limited to $ky_{-} + (n-k)y_{+}$ for $k=0, 1, \ldots, n$. Each of these occurs with probability

$${\Pr}_W(ky_{-} + (n-k)y_{+}) = \binom{n}{k}\alpha^k(1-\alpha)^{n-k}.$$

The extremes can be found by

Choosing $y_{-}$ and $y_{+}$ so that $y_{-} + (n-1)y_{+} \lt 0$; $y_{-}=-n$ and $y_{+}=1$ will accomplish this. This guarantees that $W$ will be negative except when all the $Y_i$ are positive. This chance equals $1 - (1-\alpha)^n$. It exceeds $\alpha$ when $n\gt 1$, implying the $\alpha$ quantile of $W$ must be strictly negative.
Choosing $y_{-}$ and $y_{+}$ so that $(n-1) y_{-} + y_{+} \gt 0$; $y_{-}=-1$ and $y_{+}=n$ will accomplish this. This guarantees that $W$ will be negative only when all the $Y_i$ are negative. This chance equals $\alpha^n$. It is less than $\alpha$ when $n\gt 1$, implying the $\alpha$ quantile of $W$ must be strictly positive.

This shows that the $\alpha$ quantile of $W$ could be either negative or positive, but is not zero. What could its size be? It has to equal some integral linear combination of $y_{-}$ and $y_{+}$. Making both these values integers assures all the possible values of $W$ are integral. Upon scaling $y_{\pm}$ by an arbitrary positive number $s$, we can guarantee that all integral linear combinations of $y_{-}$ and $y_{+}$ are integral multiples of $s$. Since $q_W \ne 0$, it must be at least $s$ in size. Consequently, the possible values of $q_W$ (and whence of $q_Z$) are unlimited, no matter what $n\gt 1$ may equal.

The only way to derive any information about $q_Z$ would be to make specific and strong constraints on the distributions of the $X_i$, in order to prevent and limit the kind of unbalanced distributions used to derive this negative result.

Solved – Estimating quantiles by simulation

If you first estimate $\theta$ by $\hat\theta$, a direct estimate of the $\alpha$-quantile is $F_{\hat\theta}^{-1}(\alpha)$. This is a convergent and biased estimator, whose asymptotic variance can be derived by the delta-method.

Your first solution is validated by the Glivenko–Cantelli theorem, namely the fact that the empirical cdf converges to the true cdf:$$\hat{F}_n(x)=\frac{1}{n}\sum_{i=1}^n \mathbb{I}_{X_i\le x} \stackrel{n\to\infty}{\longrightarrow}F_{\hat\theta}(x)$$Once again, $\hat{F}_n^{-1}(\alpha)$ is a convergent and biased estimator, which variance can be estimated by boostrap.

Your second method uses an average of estimators validated by your first method, hence it is equally valid and equally biased. However, for a given computing budget, i.e., a pre-determined total number of simulations, you have to run an experiment to compare both methods.

For instance, running a toy experiment aiming at estimating the normal 80% quantile (equal to 0.8416212) shows the difference between both your approaches.

#method 1
R=10^3
N=10^4
x=matrix(rnorm(R*N),ncol=R)
kant=apply(x,1,quantile,prob=.80)

leads to

> sd(kant)
[1] 0.04513716
> mean(kant)
[1] 0.8404416

while

#method2
R=10^3
M=N=10^2
kant=rep(0,R)
for (r in 1:R){
  x=matrix(rnorm(M*N),ncol=M)
  kant[r]=mean(apply(x,1,quantile,prob=.80))
  }

leads to

> sd(kant)
[1] 0.01375016
> mean(kant)
[1] 0.8285708

hence to a smaller variance but to a larger bias. (This experiment does not account for the variability in replacing $\theta$ by $\hat\theta$.)

Best Answer

Related Solutions

Solved – Compute quantile of sum of distributions from particular quantiles

Solved – Estimating quantiles by simulation

Related Question