Solved – Why use a “truncated mean” (aka “trimmed mean”)

descriptive statisticsmedian

NB: In this post, I will use the abbreviation $TM_P$ to stand for
a symmetric truncated (or trimmed) mean that discards the
largest and smallest $P/2$ percent of the data. In fact, for
concreteness I will refer mostly to $TM_{50}$, but the same question could be asked for any other "similar value" of $P$.

Question

What are situations for which a sound statistical basis exists
to prefer $TM_{50}$ (or $TM_P$ for some other more appropriate
$P$) over the median as measure of the data's central tendency?

_{EDIT: (In response to Nick Cox's ridiculing of my wording.) Here's an example of the sort of justification (speaking very broadly) I had hoped for. Q: Why choose the median over the mean? A: For robustness against outliers. The mean can be made anything one wants by a single sufficiently extreme outlier, whereas the median is naturally immune to such an instability; no outlier censoring required. Granted, very extreme outliers are probably entirely tangential to the process under study, but this reasoning still vividly underscores the robustness of the median. There's nothing subjective about it, it's a mathematical fact, even if different people find the same mathematical fact more or less problematic.}

Background

I cannot think of a justification for using $TM_{50}$ that does
not apply equally well to the median ¹. Furthermore, I
figure that, as summary statistics go, the median is the conceptually and analytically simpler of the two (and, therefore, the more thoroughly studied and better characterized one). Thus, my gut reaction is to always prefer the median over $TM_{50}$. The motivation for this thread is to either confirm or refute this gut reaction.

^{¹ Of course, the median can be thought of as,

roughly, the limit of $TM_P$ as $P \to 100$. Therefore, I figure

that any justification for using $TM_P$ over the median can only

get weaker as $P \to 100$. An analogous consideration applies when $P \to 0$, if we also replace the median with the mean.}

Best Answer

If you are comparing, say, the sample median, the sample interquartile mean (trimmed mean with 25% of data removed from highest and lowest values) and the sample mean, you have to say what you are trying to do. Otherwise asking which one is better makes no sense at all. If your goal is to estimate the 'center' of a population, you will have to face the problem that these are estimators of DIFFERENT POPULATION PARAMETERS. In this sense, they are not really comparable. For instance, with life expectancies, do you want to estimate the life expectancy that is attained by half of the people? Than you want a median. If you want the average population life expectancy, you'll estimate that with a sample mean. If you want the average life expectancy of the middle 50% of the population, you'll want to estimate it with the interquartile mean. These are not the same value if the population isn't symmetric.

With symmetric populations, all of these are, in some sense, estimators of the same parameter (I say, in some sense, because the sample mean isn't a consistent estimator of population center for some really thick tailed distributions). You'd want to pick the one that minimizes some reasonable loss function (variance?) for the distributions you are likely to work with. Under this criterion, we CAN compare the three.

In general, the mean uses more of the information in the data than does the trimmed mean, which uses more than the median. On the other hand, the median is extremely robust to errors in the data, the trimmed mean somewhat less so and the mean very susceptible to being ruined by outliers.

If your goal is to look for a shift in location for your distribution (for instance, does the life expectancy increase by a fixed amount in one treatment group), you can base comparisons between treatment groups on any measure of center, even if the distribution isn't symmetric. Then you'd select the one that has the lowest variance.

It's pretty easy to compare these via simulations (here I'm using the median, mean and the trimmed mean with 40% trimmed from both extremes). Note that these are all symmetric distributions:

> sims = matrix(ncol=3,nrow=100000,NA)
> colnames(sims) = c("mean","trimmed mean","median")
>  for(i in 1:100000){
+   x = rcauchy(20)
+   sims[i,1] = mean(x)
+   sims[i,2] = mean(x,trim=0.4)
+   sims[i,3] = median(x)
+ }
> #  Variances
> diag(var(sims))
        mean trimmed mean       median 
1.865621e+06 1.360353e-01 1.397239e-01 
> 


>  for(i in 1:100000){
+   x = rnorm(20)
+   sims[i,1] = mean(x)
+   sims[i,2] = mean(x,trim=0.4)
+   sims[i,3] = median(x)
+ }
> #  Variances
> diag(var(sims))
        mean trimmed mean       median 
  0.05007671   0.06883245   0.07347544 
> 


>  for(i in 1:100000){
+   x = rcauchy(20)^3
+   sims[i,1] = mean(x)
+   sims[i,2] = mean(x,trim=0.4)
+   sims[i,3] = median(x)
+ }
> #  Variances
> diag(var(sims))
        mean trimmed mean       median 
2.023236e+27 2.045000e-01 1.308861e-01 
>

For this simulation:

For the normal distribution, the mean has a higher variance than the others. For the Cauchy, the trimmed mean is best (variance of mean is actually infinite!), while for the cubed Cauchy, the median beats both of them.

Best Answer

Related Solutions

Solved – Median + MAD for skewed data

Central Tendency – Should the Mean Be Used When Data Are Skewed?

Related Question