Formula for weighted average difference

functionslogarithmsmeansstatistics

Intro

In my application I have instances where $m_1\ge1,m_2\ge1$ models have produced respectively (after some other calculations) activity values $a_1,a_2\in[0,1]$ ($0$ means inactive, $1$ is fully activated). The $m_1$ models represent the good ones and the $m_2$ the bad ones so to speak. I always take the difference: $$d=a_1-a_2 \in [-1,1]$$ where values closer to $-1$ indicate that the good models are in general more inactivated than the bad ones in that grouping ($m_1$ vs $m_2$ models), values closer to $d=1$ indicate that they are more activated and values closer to $0$ indicate that there is no activation difference between the two groups of models.

Weighted mean version

That works ok, but I wanted to take into account the number of models $m_1,m_2$ to counter the bias that is being introduced as can be seen in the next example: $a_1=0.9,a_2=0.1,d=0.8$, with $m_1=5,m_2=1000$. I would like to have a difference less than $0.8$, since the $m_1<<m_2$. So, I did the weighted mean (kind-of) as: $$d_{w}=\frac{m_1a_1-m_2a_2}{m_1+m_2}$$
The problem is that the penalty now is too large, e.g. for the previous example, $d_w=-0.095$, which is wrong since I would never expect it to go less than zero in this case.

Logarithmic weights version

So, I want to reduce the penalty by making the smaller and larger numbers come closer, and what is better than using $log_{10}$ to do exactly that: $$d_{lw}=\frac{log(m_1)a_1-log(m_2)a_2}{log(m_1)+log(m_2)}$$

Now, the example above produces $d_{lw}=0.0889$, much more sensible! The good models are more active, but since I got just $5$ of them vs $1000$ bad ones, the activity difference estimation is penalized.

Problems

For $m_1=1,2$ models the $d_{lw}$ result is negative in the example! And if we put $m_1=m_2$ I would expect the result to be equal to the original difference $d_{lw}=d$, but it is not 🙁 I tried to double the last difference: $d_2=2*d_{lw}$ which solved this, but now of course $d_2\in[-2,2]$ which I don't want.

The Ultimate Formula!

I seek for a difference function $f(m_1,m_2,a_1,a_2)\in[1,1]$, for which the following properties stand true:

  • $f(m,m,a_1,a_2)=a_1-a_2=d$
  • $\lim_{m_1<<m_2}f(m_1,m_2,a_1,a_2)=0$, (like in the example above). Same if m_1>>m_2.
  • The transition from the $m_1=m_2$ case to the extreme ones (where the model numbers differ too much) must be not-steep – I don't know how to actually phrase this in one word, but what I mean is that if you think the models number are starting to change from the equality $m_1=m_2$ the $d$ difference should be hard to change and only close to the extremes should we start see notable difference… This last property I have called it also the mountain (would be interesting to see to what it stands for in mathematical terms) since the equality is like the top of the mountain and right and left of it are the slopes (which in my case I want them to be walkable – i.e. not steep).

Best Answer

How about using, for some strictly increasing function $g(x)$,

$$f(m_1,m_2,a_1,a_2)=\left(\frac{g(\min(m_1, m_2))}{g(\max(m_1, m_2))}\right)(a_1 - a_2) \tag{1}\label{eq1A}$$

The simplest case for $g(x)$ would be $g(x) = x$, but you can also use something like your idea of a logarithm, e.g., $g(x) = \log(x)$, assuming the minimum of $m_1$ and $m_2$ is more than $1$. Also, to make it more flexible, you can add some constant $c \ge 0$ to the value of $x$ and/or use some power $y \gt 0$ (e.g., $g(x) = (x + c)^y$ or $g(x) = (\log(x + c))^y$, with $c \gt 0$ in this latter case allowing $m_1$ or $m_2$ to be $1$). You can try several things to see what works best for you. For simplicity, I will use $c = 0$ and $g(x) = x^y$, with $y = 1$ unless stated otherwise, for the rest of this answer.

Note \eqref{eq1A} satisfies your first requested property, i.e., $f(m, m, a_1, a_2) = a_1 - a_2 = d$. Also, with $y = 1$, your example of $a_1 = 0.9, a_2 = 0.1, d = 0.8$, with $m_1 = 5, m_2 = 1000$, would give $f(5,1000,0.9,0.1) = \frac{5}{1000}(0.9 - 0.1) = 0.004$. This is a bit less than your $d_{lw}$ model, but possibly still reasonable considering how much $1000$ is relatively larger than $5$. However, if you would like to reduce the effect of the difference, so the end result would be closer to $0.8$, you can use a smaller value of $y$. For example, with $y = 0.1$, you would get $f(5,1000,0.9,0.1) = (0.5887...)(0.8) \approx 0.47$.

Also, this would satisfy your second condition, i.e.,

$\lim_{m_1<<m_2}f(m_1,m_2,a_1,a_2)=0$, (like in the example above). Same if m_1>>m_2.

I believe this would also fit your third condition fairly well, i.e., it would be not-steep when transitioning from the $m_1 = m_2$ case to the extreme ones. The degree of "not steepness" would depend on what $g(x)$ function, including the value of $c$ and the $y$ power you choose, but I believe it should be quite reasonable for any appropriate choices.

Finally, note you could replace $\frac{g(\min(m_1, m_2))}{g(\max(m_1, m_2))}$ with an even more general $2$ variable function $h(\min(m_1, m_2), \max(m_1, m_2))$, where it behaves in a similar fashion, e.g., $f(x,x) = 1$ and $f(x,y) \to 0$ as $\frac{y}{x} \to \infty$. However, I don't know if that extra complexity will give you much of an advantage compared to the simpler case I provided here, which is why I only mention it at the end.

Related Question