Solved – Quantiles from the combination of normal distributions

aggregationgaussian mixture distributionnormal distributionquantiles

I have information on the distributions of anthropometric dimensions (like shoulder span) for children of different ages. For each age and dimension, I have mean, standard deviation. (I also have eight quantiles, but I don't think I'll be able to get what I want from them.)

For each dimension, I would like to estimate particular quantiles of the length distribution. If I assume that each of the dimensions is normally distributed, I can do this with the means and standard deviations. Is there a pretty formula that I can use to get the value associated with a particular quantile of the distribution?

The reverse is quite easy: For a particular value, get the area to the right of the value for each of the normal distributions (ages). Sum the results and divide by the number of distributions.

Update: Here's the same question in graphical form. Assume that each of the colored distributions is normally distributed.
The same question in graphical form

Also, I obviously can just try a bunch of different lengths and keep changing them until I get one that's close enough to the desired quantile for my precision. I'm wondering if there is a better way than this. And if this is the right approach, is there a name for it?

Best Answer

Unfortunately, the standard normal (from which all others can be determined, since the normal is a location-scale family) quantile function does not admit a closed form (i.e. a 'pretty formula'). The closest thing to a closed form is that the standard normal quantile function is the function, $w$, that satisfies the differential equation

$$ \frac{d^2 w}{d p^2} = w \left(\frac{d w}{d p}\right)^2 $$

and the initial conditions $w(1/2) = 0$ and $w'(1/2) = \sqrt{2 \pi}$. In most computing environments there is a function that numerically calculates the normal quantile function. In R, you would type

qnorm(p, mean=mu, sd=sigma)

to get the $p$'th quantile of the $N(\mu, \sigma^2)$ distribution.


Edit: With a modified understanding of the problem, the data is generated from a mixture of normals, so that the density of the observed data is:

$$ p(x) = \sum_{i} w_{i} p_{i}(x) $$

where $\sum_{i} w_{i} = 1$ and each $p_{i}(x)$ is some normal density with mean $\mu_{i}$ and standard deviation $\sigma_{i}$. It follows that the CDF of the observed data is

$$ F(y) = \int_{-\infty}^{y} \sum_{i} w_{i} p_{i}(x) dx = \sum_{i} w_{i} \int_{-\infty}^{y} p_{i}(x) = \sum_{i} w_{i} F_{i}(y) $$

where $F_{i}(x)$ is the normal CDF with mean $\mu_{i}$ and standard deviation $\sigma_{i}$. Integration and summation can be interchanged because these integrals are finite. This CDF is continuous and easy enough to calculate on a computer, so the inverse CDF, $F^{-1}$, also known as the quantile function, can be calculated by doing a line search. I default to this option because no simple formula for the quantile function of a mixture of normals, as a function of the quantiles of the constituent distributions, comes to mind.

The following R code numerically calculates $F^{-1}$ using bisection for the line search. The function F_inv() is the quantile function, you need to supply the vector containing each $w_{i}, \mu_{i}, \sigma_{i}$ and the quantile to be solved for, $p$.

# evaluate the function at the point x, where the components 
# of the mixture have weights w, means stored in u, and std deviations
# stored in s - all must have the same length.
F = function(x,w,u,s) sum( w*pnorm(x,mean=u,sd=s) )

# provide an initial bracket for the quantile. default is c(-1000,1000). 
F_inv = function(p,w,u,s,br=c(-1000,1000))
{
   G = function(x) F(x,w,u,s) - p
   return( uniroot(G,br)$root ) 
}

#test 
# data is 50% N(0,1), 25% N(2,1), 20% N(5,1), 5% N(10,1)
X = c(rnorm(5000), rnorm(2500,mean=2,sd=1),rnorm(2000,mean=5,sd=1),rnorm(500,mean=10,sd=1))
quantile(X,.95)
    95% 
7.69205 
F_inv(.95,c(.5,.25,.2,.05),c(0,2,5,10),c(1,1,1,1))
[1] 7.745526

# data is 20% N(-5,1), 45% N(5,1), 30% N(10,1), 5% N(15,1)
X = c(rnorm(5000,mean=-5,sd=1), rnorm(2500,mean=5,sd=1),
      rnorm(2000,mean=10,sd=1), rnorm(500, mean=15,sd=1))
quantile(X,.95)
     95% 
12.69563 
F_inv(.95,c(.2,.45,.3,.05),c(-5,5,10,15),c(1,1,1,1))
[1] 12.81730