Solved – Confusion regarding weighted average

meanrweighted mean

This is a pretty simple question and I'm wondering how to go about finding a solution. I have the following data.

ttt = data.frame(name=c("A","B","C","D","E","F"),
                 count=c(150,250,350,550,150,50),
                 returns=c(10,50,60,100,80,10),
                 calls_bad=c(5,30,20,15,15,20),
                 weight=c(0.20,0.30,0.40,0.40,0.20,0.10))
ttt

For each name, I'm trying to find the return rate, which is just returns/count. However, I want to know how to weight that by how high or low count is. Name F, which only has a count of 50, should be weighted less (weight is 0.10) than name D, who has a count of 550 (weight is 0.40).

ttt$returns_rate = round(ttt$returns/ttt$count, 2)
ttt
      name count returns weight returns_rate
    1    A   150      10    0.2         0.07
    2    B   250      50    0.3         0.20
    3    C   350      60    0.4         0.17
    4    D   550     100    0.4         0.18
    5    E   150      80    0.2         0.53
    6    F    50      10    0.1         0.20
d1 = ddply(ttt, .(name), function(x) data.frame(score=weighted.mean(x$returns_rate, x$weight)))
d1
  name score
1    A  0.07
2    B  0.20
3    C  0.17
4    D  0.18
5    E  0.53
6    F  0.20

Here are the weights:

Count <= 100 = 10%
Count >= 101 | Count <= 200 = 20%
Count >= 201 | Count <= 300 = 30%
Count >= 301 = 40%

Given what I'm trying to achieve, I thought I'd want a weight average but end up with the following results. However, this can't be right as the returns_rate is equal to the score.

How can I go about getting the rate while accounting for the number of count?

Thanks! I must've fallen asleep in basic math on this one.

EDIT:
In lieu of the comments =

Eventually, the plan is to combine the call_bad rate (calls_bad/count) and the returns rate (returns/count) into one 'score'/'metric' for each name. However, because count varies significantly by name, I'm working under the assumption that I need to 'weight' the data in order to account for the impact of small 'counts'.

Basically, I want to find (returns/count) and (calls_bad/count), and combine these values into one value, while accounting for the fact that someone with a count of 50 could influence the data in a bad way, thus why I'm thinking of using weights.

EDIT 2:

ttt = data.frame(name=c("A","B","C","D","E","F"),
                 count=c(150,250,350,550,150,50),
                 returns=c(10,50,60,100,80,10),
                 calls_bad=c(5,30,20,15,15,20),
                 weight=c(0.20,0.30,0.40,0.40,0.20,0.10))

ttt$returns_rate = round(ttt$returns/ttt$count, 2)
    ttt$calls_bad_rate = round(ttt$calls_bad/ttt$count, 2)

ttt

There only two numbers to "combine" or "average", 'returns rate' and 'calls bad rate'.

ttt$combined = round(ttt$returns_rate + ttt$calls_bad_rate / 2, 2)

But given how count varies "significantly" by name, I thoughts weight were appropriate based on the number of count.

Best Answer

I was writing my own weighted average algorithm yesterday and it applies to what you’re looking to do. Your problem is what you're calling returns_rate and the logic behind it. For these examples we have a new field called weighted_score or you could call it weighted_returns if you wanted. The point is that its weighted.

id | name | count | returns | weight | weighted_score
-----------------------------------------------------
1     A      150      10       0.2         2
2     B      250      50       0.3         15
3     C      350      60       0.4         24
4     D      550      100      0.4         40
5     E      150      80       0.2         16
6     F      50       10       0.1         1
-----------------------------------------------------
sums                           1.6         98
weighted_average                           61.3

Logic:

weighted_score = returns*weight
weighted_average = sum(weighted_scores)/sum(weights)

I see that you have some logic to determine the weight however your upper limit is 40%. There really should be 0% through 100% in there. You can also calculate the weight on the fly for any given data set by finding the highest “count” and dividing the “count” in each row by that highest number and this gives you the appropriate 0% - 100% weights.
Example:

id | name | count | returns | weight | weighted_score
-----------------------------------------------------
1     A     150      10       0.2727         2.7
2     B     250      50       0.4545         22.7
3     C     350      60       0.6364         38.2
4     D     550      100      1.0000         100
5     E     150      80       0.2727         21.8
6     F     50       10       0.0909         0.9
-----------------------------------------------------
sums                          2.7273         186.3636
weighted_average                             68.3

Logic:

weight = count/550

Another simpler option that doesn’t require so much processing trying to figure out the appropriate weight is to just create a static variable to use as your ceiling. Any count lower than the ceiling will be weighted with lower importance and anything above the ceiling is weighted at 100%.

id | name | count | returns | weight | weighted_score
-----------------------------------------------------
1     A     150      10       0.4983         2.7
2     B     250      50       0.8306         22.7
3     C     350      60       1.0000         38.2
4     D     550      100      1.0000         100
5     E     150      80       0.4983         21.8
6     F     50       10       0.1661         0.9
-----------------------------------------------------
sums                          3.9934         248.0399
weighted_average                             62.1

Logic:

if(count < 301) weight = count/301
if(count > 301) weight = count/count

Related Solutions

Solved – Weighted standard deviation of average

You're confusing addition of random variables with concatenation of samples (easy to do, it took me a while to realize why your code didn't work!). So for independent random variables $X$ and $Y$ you can write $\rm{Var}(aX+bY) = a^2\rm{Var}(X) + b^2\rm{Var}(Y)$, noting the coefficients are squared. This would apply (approximately) to samples as follows

set.seed(1)
n <- 100
mu = 1
sigma = 1
a<-rnorm(n,mu,sigma)
b<-rnorm(n,mu,sigma)
weight_a <- 1/4
weight_b <- 3/4
sqrt(sd(a)^2*weight_a^2+sd(b)^2*weight_b^2)
[1] 0.7526849
sd(weight_a*a + weight_b*b)
[1] 0.7524718

For concatenation of samples you need to apply the formula for variance $\rm{Var}(X) = E(X^2) - (E(X))^2$, adjusted appropriately for the fact that you're using samples, i.e. multiply by $n/(n-1)$. This is illustrated with R code below.

suma2 = sum(a^2)
sumb2 = sum(b^2)
suma = sum(a)
sumb = sum(b)
sumc2 = suma2 + sumb2
sumc = suma + sumb
nc = n + n
(sumb2/n - (sumb/n)^2) * n/(n-1)
[1] 0.9175323
sd(b)^2
[1] 0.9175323
sumc2 = suma2 + sumb2
sumc = suma + sumb
(sumc2/nc - (sumc/nc)^2) * nc/(nc-1)
[1] 0.8632217
sd(c(a,b))^2
[1] 0.8632217

When you are concatenating weighted series, the weights can be incorporated naturally to give the formula.

$$ \rm{Var}(\mathbf{w}) = \left(\frac{w_a^2 S_{aa} + w_b^2 S_{bb}}{n_a + n_b} - \left(\frac{w_a S_a + w_b S_b}{n_a + n_b}\right)^2 \right) \times \frac{n_a + n_b}{n_a + n_b -1}, $$

where $\mathbf{w}$ is the series formed by concatenating series $\mathbf{a}$ of length $n_a$ weighted by $w_a$ and series $\mathbf{b}$ of length $n_b$ weighted by $w_b$. $S_{aa}$ is the sum of squares of series $\mathbf{a}$, $S_{a}$ is the sum of series $\mathbf{a}$, and likewise for $\mathbf{b}$ .

This is illustrated in the following R code.

sumw2 = weight_a^2*suma2 + weight_b^2*sumb2
sumw = weight_a*suma + weight_b*sumb
(sumw2/nc - (sumw/nc)^2) * nc/(nc-1)
[1] 0.3314697
sd(c(weight_a*a, weight_b*b))^2
[1] 0.3314697

Note that in the first case, addition, I had to use the same size of series. For convenience, I carried on with the same series. But the code used for the second case, concatenation, should work for different-sized series.

However, as @Peter Ellis states in his answer, the problem you are ultimately trying to solve may be something like a two-sample t test, where you assume that the two samples are from the same population, and so the underlying variance of each sample is the same. In this case, you want to estimate the population variance. The formula for the pooled variance is well-known and can be found on wikipedia (wikipedia also provides a generalization to multiple samples).

Solved – Weighted average vs “unweighted” average in probability

What you are calling the "weighted average" is the only proper way to calculate the percentage of satisfied customers in both shops. Taking average of percentages (what you call "unweighted" average) will give you useless results if your samples differ in size.

Imagine extreme case: you have two shops A and B, in shop A there was one customer and he was not satisfied and in shop B there were 100 customers and 90 of them were satisfied -- would you conclude that 45% of customers of both shops were satisfied? Obviously not!

The smaller sample is much less reliable, so it should not be included in the final estimate with the same weight as the larger one. Speaking more formally, estimates from the smaller sample has a greater error.

Two shops: A and B, in A there are two customers, one happy, in B, there were 100 and 75 were happy. So they have 50% and 75% happy customers, so the average is 62.5%. Using your example from the comment:

Standard error for the estimate from the first shop is $\frac{0.5(1-0.5)}{\sqrt 2} = 0.18$, while for the second one $\frac{0.75(1-0.75)}{\sqrt{100}} = 0.02$. So the possible deviation from the true proportion of satisfied customers in shop A is much larger. If the samples differ that much in their reliability, you shouldn't give them equal trust and weight them equally when taking their average.

You can easily convince yourself why using pooled mean is a much better idea than using the raw one by conducting a simple simulation study. I simulate two shops, with $n=8$ and $n=100$ customers, both having the same proportion of satisfied customers. If you compare the results obtained using raw mean ("unweighted") and the pooled ("weighted") estimates, you'll see that when using the raw mean the variability of the difference between the estimate and the true value is much greater. Saying this in simple English: using raw mean you are at greater risk of obtaining the wrong estimate than in the case of pooled mean.

For much more advanced methods of pooling different probabilities, you can check Combining probabilities/information from different sources thread or this one Combining two estimates .

Best Answer

Related Solutions

Solved – Weighted standard deviation of average

Solved – Weighted average vs “unweighted” average in probability

Related Question