I have information on the distributions of anthropometric dimensions (like shoulder span) for children of different ages. For each age and dimension, I have mean, standard deviation. (I also have eight quantiles, but I don't think I'll be able to get what I want from them.)
For each dimension, I would like to estimate particular quantiles of the length distribution. If I assume that each of the dimensions is normally distributed, I can do this with the means and standard deviations. Is there a pretty formula that I can use to get the value associated with a particular quantile of the distribution?
The reverse is quite easy: For a particular value, get the area to the right of the value for each of the normal distributions (ages). Sum the results and divide by the number of distributions.
Update: Here's the same question in graphical form. Assume that each of the colored distributions is normally distributed.
Also, I obviously can just try a bunch of different lengths and keep changing them until I get one that's close enough to the desired quantile for my precision. I'm wondering if there is a better way than this. And if this is the right approach, is there a name for it?
Best Answer
Unfortunately, the standard normal (from which all others can be determined, since the normal is a location-scale family) quantile function does not admit a closed form (i.e. a 'pretty formula'). The closest thing to a closed form is that the standard normal quantile function is the function, $w$, that satisfies the differential equation
$$ \frac{d^2 w}{d p^2} = w \left(\frac{d w}{d p}\right)^2 $$
and the initial conditions $w(1/2) = 0$ and $w'(1/2) = \sqrt{2 \pi}$. In most computing environments there is a function that numerically calculates the normal quantile function. In R, you would type
to get the $p$'th quantile of the $N(\mu, \sigma^2)$ distribution.
Edit: With a modified understanding of the problem, the data is generated from a mixture of normals, so that the density of the observed data is:
$$ p(x) = \sum_{i} w_{i} p_{i}(x) $$
where $\sum_{i} w_{i} = 1$ and each $p_{i}(x)$ is some normal density with mean $\mu_{i}$ and standard deviation $\sigma_{i}$. It follows that the CDF of the observed data is
$$ F(y) = \int_{-\infty}^{y} \sum_{i} w_{i} p_{i}(x) dx = \sum_{i} w_{i} \int_{-\infty}^{y} p_{i}(x) = \sum_{i} w_{i} F_{i}(y) $$
where $F_{i}(x)$ is the normal CDF with mean $\mu_{i}$ and standard deviation $\sigma_{i}$. Integration and summation can be interchanged because these integrals are finite. This CDF is continuous and easy enough to calculate on a computer, so the inverse CDF, $F^{-1}$, also known as the quantile function, can be calculated by doing a line search. I default to this option because no simple formula for the quantile function of a mixture of normals, as a function of the quantiles of the constituent distributions, comes to mind.
The following R code numerically calculates $F^{-1}$ using bisection for the line search. The function F_inv() is the quantile function, you need to supply the vector containing each $w_{i}, \mu_{i}, \sigma_{i}$ and the quantile to be solved for, $p$.