Solved – Confusion regarding weighted average

meanrweighted mean

This is a pretty simple question and I'm wondering how to go about finding a solution. I have the following data.

ttt = data.frame(name=c("A","B","C","D","E","F"),
                 count=c(150,250,350,550,150,50),
                 returns=c(10,50,60,100,80,10),
                 calls_bad=c(5,30,20,15,15,20),
                 weight=c(0.20,0.30,0.40,0.40,0.20,0.10))
ttt

For each name, I'm trying to find the return rate, which is just returns/count. However, I want to know how to weight that by how high or low count is. Name F, which only has a count of 50, should be weighted less (weight is 0.10) than name D, who has a count of 550 (weight is 0.40).

ttt$returns_rate = round(ttt$returns/ttt$count, 2)
ttt
      name count returns weight returns_rate
    1    A   150      10    0.2         0.07
    2    B   250      50    0.3         0.20
    3    C   350      60    0.4         0.17
    4    D   550     100    0.4         0.18
    5    E   150      80    0.2         0.53
    6    F    50      10    0.1         0.20
d1 = ddply(ttt, .(name), function(x) data.frame(score=weighted.mean(x$returns_rate, x$weight)))
d1
  name score
1    A  0.07
2    B  0.20
3    C  0.17
4    D  0.18
5    E  0.53
6    F  0.20

Here are the weights:

Count <= 100 = 10%
Count >= 101 | Count <= 200 = 20%
Count >= 201 | Count <= 300 = 30%
Count >= 301 = 40%

Given what I'm trying to achieve, I thought I'd want a weight average but end up with the following results. However, this can't be right as the returns_rate is equal to the score.

How can I go about getting the rate while accounting for the number of count?

Thanks! I must've fallen asleep in basic math on this one.

EDIT:
In lieu of the comments =

Eventually, the plan is to combine the call_bad rate (calls_bad/count) and the returns rate (returns/count) into one 'score'/'metric' for each name. However, because count varies significantly by name, I'm working under the assumption that I need to 'weight' the data in order to account for the impact of small 'counts'.

Basically, I want to find (returns/count) and (calls_bad/count), and combine these values into one value, while accounting for the fact that someone with a count of 50 could influence the data in a bad way, thus why I'm thinking of using weights.

EDIT 2:

ttt = data.frame(name=c("A","B","C","D","E","F"),
                 count=c(150,250,350,550,150,50),
                 returns=c(10,50,60,100,80,10),
                 calls_bad=c(5,30,20,15,15,20),
                 weight=c(0.20,0.30,0.40,0.40,0.20,0.10))

ttt$returns_rate = round(ttt$returns/ttt$count, 2)
    ttt$calls_bad_rate = round(ttt$calls_bad/ttt$count, 2)

ttt

There only two numbers to "combine" or "average", 'returns rate' and 'calls bad rate'.

ttt$combined = round(ttt$returns_rate + ttt$calls_bad_rate / 2, 2)

But given how count varies "significantly" by name, I thoughts weight were appropriate based on the number of count.

Best Answer

I was writing my own weighted average algorithm yesterday and it applies to what you’re looking to do. Your problem is what you're calling returns_rate and the logic behind it. For these examples we have a new field called weighted_score or you could call it weighted_returns if you wanted. The point is that its weighted.

id | name | count | returns | weight | weighted_score
-----------------------------------------------------
1     A      150      10       0.2         2
2     B      250      50       0.3         15
3     C      350      60       0.4         24
4     D      550      100      0.4         40
5     E      150      80       0.2         16
6     F      50       10       0.1         1
-----------------------------------------------------
sums                           1.6         98
weighted_average                           61.3

Logic:

weighted_score = returns*weight
weighted_average = sum(weighted_scores)/sum(weights)

I see that you have some logic to determine the weight however your upper limit is 40%. There really should be 0% through 100% in there. You can also calculate the weight on the fly for any given data set by finding the highest “count” and dividing the “count” in each row by that highest number and this gives you the appropriate 0% - 100% weights.
Example:

id | name | count | returns | weight | weighted_score
-----------------------------------------------------
1     A     150      10       0.2727         2.7
2     B     250      50       0.4545         22.7
3     C     350      60       0.6364         38.2
4     D     550      100      1.0000         100
5     E     150      80       0.2727         21.8
6     F     50       10       0.0909         0.9
-----------------------------------------------------
sums                          2.7273         186.3636
weighted_average                             68.3

Logic:

weight = count/550

Another simpler option that doesn’t require so much processing trying to figure out the appropriate weight is to just create a static variable to use as your ceiling. Any count lower than the ceiling will be weighted with lower importance and anything above the ceiling is weighted at 100%.

id | name | count | returns | weight | weighted_score
-----------------------------------------------------
1     A     150      10       0.4983         2.7
2     B     250      50       0.8306         22.7
3     C     350      60       1.0000         38.2
4     D     550      100      1.0000         100
5     E     150      80       0.4983         21.8
6     F     50       10       0.1661         0.9
-----------------------------------------------------
sums                          3.9934         248.0399
weighted_average                             62.1

Logic:

if(count < 301) weight = count/301
if(count > 301) weight = count/count
Related Question