Solved – Is R’s trimmed means function biased

algorithmsbugsrtrimmed-mean

I've been trying to understand how R's trimmed mean function works. I suspect it might be biased, but would like to get feedback here before I file a bug-report (if this is an inappropriate forum for such a question please let me know).

Consider the following example (sorted for clarity):

> x <- c(1, 2, 3, 45, 55, 56, 57, 58, 403, 900)
> length(x)
[1] 10
> percent.to.trim <- 0.25

One would expect that the trimmed mean would be based on length(x) - 2 * length(x) * percent.to.trim = 5 observations. If this is the case, there are three possible methods of calculation:

  1. To remove the two smallest observations and the three largest observations. In this case, mean(x[3:7]) = 43.2.
  2. To remove the three smallest observations and the two largest observations. In this case, mean(x[4:8]) = 54.2.
  3. To remove the two smallest observations, the two largest observations and to average the third smallest and third largest observation (this is to be preferred for obvious reasons). In this case mean(c(x[4:7], mean(c(x[3], x[8])))) = 48.7.

However, R gives the following result:

> mean(x, trim=percent.to.trim)
[1] 45.66667

This must result from the trimmed mean being based on length(x) - 2 * floor(length(x) * percent.to.trim) = 6 observations, thus removing only the first two and last two observations:

> mean(x[3:8])
[1] 45.66667

Is this a bug?

I'm using R version 2.13.1 (2011-07-08), svn rev 56322.

Best Answer

In such cases it is a good idea to look at the source code. In this case the source code of mean.default

> mean.default
function (x, trim = 0, na.rm = FALSE, ...) 
{
    if (!is.numeric(x) && !is.complex(x) && !is.logical(x)) {
        warning("argument is not numeric or logical: returning NA")
        return(NA_real_)
    }
    if (na.rm) 
        x <- x[!is.na(x)]
    if (!is.numeric(trim) || length(trim) != 1L) 
        stop("'trim' must be numeric of length one")
    n <- length(x)
    if (trim > 0 && n) {
        if (is.complex(x)) 
            stop("trimmed means are not defined for complex data")
        if (any(is.na(x))) 
            return(NA_real_)
        if (trim >= 0.5) 
            return(stats::median(x, na.rm = FALSE))
        lo <- floor(n * trim) + 1
        hi <- n + 1 - lo
        x <- sort.int(x, partial = unique(c(lo, hi)))[lo:hi]
    }
    .Internal(mean(x))
}

From this the trimming is based on the lower and upper index bounds lo <- floor(n * trim) + 1 and hi <- n + 1 - lo, which are the design decisions implemented in R. Now, while writing this I see that Nick Sabbe posted more or less the same answer, but I leave the entire code posted here as well.

P.S. I think the clue to reading the help page is that it says a symmetrically trimmed mean, not the symmetrically trimmed mean.

Related Question