Solved – Calculating mean and variance with logarithmic sample weights

data transformationsamplingweighted-sampling

I have run into a problem that must be pretty simple, but I keep getting snagged somewhere. I have an algorithm that returns a sample and the logarithm of the sample weight (which get themselves pretty large, residing in the range 1e2 – 1e4 for typical applications). Now, to take the sample weighted mean, one can use the formula:

$\langle Q \rangle = \frac{ \sum_i w_i Q_i }{ \sum_i w_i }$

Taking the log of both sides:

$\ln \langle Q \rangle = \ln \sum_i w_i Q_i – \ln \sum_i w_i$

and applying the identity to both RHS terms (see Wikipedia):

$\ln (a+b) = \ln a + \ln \big[1+\exp(\ln b – \ln a)\big]$

in the general form for a sequence of numbers:

$\ln \sum_i a_i = \ln a_0 + \ln \bigg[1 + \sum_i \exp( \ln a_i – \ln a_0 )\bigg]$

would allow one to calculate the mean using only the logarithm of the sample weights, and not the sample weight directly (and similarly for the variance). The problem here, however, is obviously numerical stability. After some research, I have come across two parallel suggestions:

(1) For the case of two points, $a$ and $b$, it's easy to order the terms such that $b > a$ and apply methods for calculating $\ln(1+\epsilon)$, with $\epsilon$ between 0 and 1. I'm failing to see, however, how this simplification generalizes to the case of more than two points, where the value of the sum (in the general formulation) will probably be above 1.

(2) Using an algorithm for calculating the weighted incremental variance (as here). I'm also failing to see here how to apply this algorithm to the case of logarithmic sample weights, as the mathematical formula on which to apply the logarithm (and identities) is not clear.

My apologies for the long question, but after some time of going through the research on tricks for making this calculation more stable, I am still getting quite confused, as the tricks either don't seem to apply generally, or lead to contradictions.

Any thoughts on this problem?

Best Answer

Update 2014-04-04: Create the reasonably-sized weights from the logarithms. See below:

I don't see that a logarithmic approach is needed. To deal with very large weights, divide each by a large constant, e.g. $10^3$ or $10^4$, which will be a simple matter of moving the decimal point. Then apply the standard formulas for weighted means; these are invariant to changes in the weights of the form $w'=C\thinspace w$, because the constant $C$ cancels out in numerator and denominator. Similar remarks apply to weighted estimates of variance.

Update: Get revised weight $w'$ from logs

If $C = $ e.g. $10^3$ or $10^4$, and $\log(w)$ is the log weight,

$$ w' = \frac{w}{C} = \exp(\log(w)-\log(C)) $$ which for $C = 10^4$ would be

$$ w' = \exp(\log(w)-4\log(10)) $$

Note that the sum in your last expression

$$ \sum_i \text{exp }\left(\text{ln } a_i -\text{ln }a_0 \right) $$

is equivalent to writing

$$ \sum_i \left(\frac{a_i}{a_0}\right) $$

This is just a standardization of each $a_i, i\gt 0$, by the first term. Thus your proposal cannot escape summing the weights.

Best Answer

Related Solutions

Solved – Calculating sample mean and sample variance on all samples vs distinct subsets

Solved – Using the sample weights in regression

Related Question