[Math] Combining geometric means from different datasets

probabilitystatistics

In statistics, one sometimes uses the geometric mean which for a dataset $\{x_i\}_{i=1}^N$ is defined as $$(\prod\limits_{i=1}^N x_i)^{(1/N)}.$$
This is particularly useful when the data of the experiment is distributed across many orders of magnitude, so that it would make more sense to plot histograms on a log scale than a linear scale.

Now suppose I am doing an experiment to determine some experimental variable $X$, which theoretically is the geometric mean of the data set I measure. Suppose I have done this experiment repeatedly to generate $r$ data sets with geometric means $\mu_1, \mu_2, \ldots, \mu_r$ and geometric standard deviations $\sigma_1,\sigma_2, \ldots, \sigma_r$. How do I combine the data from these different experiments to obtain one geometric mean that is the best estimate for the variable?

Can one simply take the arithmetic mean and standard error of this collection of means? If so, why? Or should we consider a different 'geometric error of the mean'?

Best Answer

If $$\mu_k = \left(\prod_{i=1}^{N_k} x_i \right)^{1/N_k}$$ is the geometric mean from the $k^{\rm th}$ sample, with sample size $N_k$, then the overall geometric mean is $$\mu = \left(\prod_{k=1}^r \mu_k^{N_k} \right)^{1/N}, \quad N = \sum_{k=1}^r N_k.$$ The geometric standard deviation can be found by similar inversion of the formula $$\sigma_k = \exp\left(\sqrt{\frac{1}{N_k}\sum_{i=1}^{N_k} \left(\log \frac{x_i}{\mu_k}\right)^2}\right):$$ we have $$\sum_{i=1}^{N_k} \left(\log \frac{x_i}{\mu_k}\right)^2 = N_k \left(\log \sigma_k \right)^2.$$ I leave it as an exercise to show that $$\sum_{i=1}^{N_k} \left(\log \frac{x_i}{\mu} \right)^2 = N_k \left( (\log \sigma_k)^2 - \left(\log \frac{\mu}{\mu_k}\right)^2\right),$$ hence $$N (\log \sigma)^2 = \sum_{k=1}^r \sum_{i=1}^{N_k} \left( \log \frac{x_i}{\mu} \right)^2 = \sum_{k=1}^r N_k \left( (\log \sigma_k)^2 - \left(\log \frac{\mu}{\mu_k}\right)^2\right),$$ from which we can recover the overall geometric standard deviation $\sigma$. Note that these formulas use only the geometric means, geometric standard deviations, and sample sizes $$\{\mu_k\}_{k=1}^r, \quad \{\sigma_k\}_{k=1}^r, \quad \{N_k\}_{k=1}^r,$$ and do not require the original data.

Related Question