Solved – Trimmed mean vs median

meanmedianoutlierstrimmed-meantypes-of-averages

I have a data set with all the calls made to an emergency service and the response times of the ambulance department. They admitted that there are some mistakes with the response times as there are cases where they didn't start recording (so the value is 0) or where they didn't stop the clock (so the value can be extremely high).

I want to find out the central tendency and I was wondering if it is better to use the median or the trimmed mean in order to get rid of the outliers?

Best Answer

Consider what a trimmed mean is: In the prototypical case, you first sort your data in increasing order. Then you count up to the trimming percentage from the bottom and discard those values. For example a 10% trimmed mean is common; in that case you count up from the lowest value until you've passed 10% of all the data in your set. The values below that mark are set aside. Likewise, you count down from the highest value until you've passed your trimming percentage, and set all values greater than that aside. You are now left with the middle 80%. You take the mean of that, and that is your 10% trimmed mean. (Note that you can trim unequal proportions from the two tails, or only trim one tail, but these approaches are less common and don't seem as applicable to your situation.)

Now think of what would happen if you calculated a 50% trimmed mean. The bottom half would be set aside, as would the top half. You would be left with only the single value in the middle (ordinally). You would take the mean of that (which is to say, you would just take that value) as your trimmed mean. Note however, that that value is the median. In other words, the median is a trimmed mean (it is a 50% trimmed mean). It is just a very aggressive one. It assumes, in essence, that 99% of your data are contaminated. This gives you the ultimate protection against outliers at the expense of the ultimate loss of power / efficiency.

My guess is a median / 50% trimmed mean is much more aggressive than is necessary for your data, and is too wasteful of the information available to you. If you have any sense of the proportion of outliers that exist, I would use that information to set the trimming percentage and use the appropriate trimmed mean. If you don't have any basis to choose the trimming percentage, you could select one by cross validation, or use a robust regression analysis with only an intercept.

Related Question