Mean vs Median – In-depth Properties of Mean and Median

meanmedianrobustsensitivity analysistypes-of-averages

Can somebody explain me clear the mathematical logic that would link two statements (a) and (b) together? Let us have a set of values (some distribution). Now,

a) Median does not depend on every value [it just depends on one or two middle values];
b) Median is the locus of minimal sum-of-absolute-deviations from it.

And likewise, and in contrast,

a) (Arithmetic) mean depends on every value;
b) Mean is the locus of minimal sum-of-squared-deviations from it.

My grasp of it is intuitive so far.

Best Answer

This is two questions: one about how the mean and median minimize loss functions and another about the sensitivities of these estimates to the data. The two questions are connected, as we will see.

Minimizing Loss

A summary (or estimator) of the center of a batch of numbers can be created by letting the summary value change and imagining that each number in the batch exerts a restoring force on that value. When the force never pushes the value away from a number, then arguably any point at which the forces balance is a "center" of the batch.

Quadratic ($L_2$) Loss

For instance, if we were to attach a classical spring (following Hooke's Law) between the summary and each number, the force would be proportional to the distance to each spring. The springs would pull the summary this way and that, eventually settling to a unique stable location of minimal energy.

I would like to draw notice to a little sleight-of-hand that just occurred: the energy is proportional to the sum of squared distances. Newtonian mechanics teaches us that force is the rate of change of energy. Achieving an equilibrium--minimizing the energy--results in balancing the forces. The net rate of change in the energy is zero.

Let's call this the "$L_2$ summary," or "squared loss summary."

Absolute ($L_1$) Loss

Another summary can be created by supposing the sizes of the restoring forces are constant, regardless of the distances between the value and the data. The forces themselves are not constant, however, because they must always pull the value towards each data point. Thus, when the value is less than the data point the force is directed positively, but when the value is greater than the data point the force is directed negatively. Now the energy is proportional to the distances between the value and the data. There typically will be an entire region in which the energy is constant and the net force is zero. Any value in this region we might call the "$L_1$ summary" or "absolute loss summary."

These physical analogies provide useful intuition about the two summaries. For instance, what happens to the summary if we move one of the data points? In the $L_2$ case with springs attached, moving one data point either stretches or relaxes its spring. The result is a change in force on the summary, so it must change in response. But in the $L_1$ case, most of the time a change in a data point does nothing to the summary, because the force is locally constant. The only way the force can change is for the data point to move across the summary.

(In fact, it should be evident that the net force on a value is given by the number of points greater than it--which pull it upwards--minus the number of points less than it--which pull it downwards. Thus, the $L_1$ summary must occur at any location where the number of data values exceeding it exactly equals the number of data values less than it.)

Depicting Losses

Since both forces and energies add up, in either case we can decompose the net energy into individual contributions from the data points. By graphing the energy or force as a function of the summary value, this provides a detailed picture of what is happening. The summary will be a location at which the energy (or "loss" in statistical parlance) is smallest. Equivalently, it will be a location at which forces balance: the center of the data occurs where the net change in loss is zero.

This figure shows energies and forces for a small dataset of six values (marked by faint vertical lines in each plot). The dashed black curves are the totals of the colored curves showing the contributions from the individual values. The x-axis indicates possible values of the summary.

Figure 1

The arithmetic mean is a point where squared loss is minimized: it will be located at the vertex (bottom) of the black parabola in the upper left plot. It is always unique. The median is a point where absolute loss is minimized. As noted above, it must occur in the middle of the data. It is not necessarily unique. It will be located at the bottom of the broken black curve in the upper right. (The bottom actually consists of a short flat section between $-0.23$ and $-0.17$; any value in this interval is a median.)

Analyzing Sensitivity

Earlier I described what can happen to the summary when a data point is varied. It is instructive to plot how the summary changes in response to changing any single data point. (These plots are essentially the empirical influence functions. They differ from the usual definition in that they show the actual values of the estimates rather than how much those values are changed.) The value of the summary is labeled by "Estimate" on the y-axes to remind us that this summary is estimating where the middle of the dataset lies. The new (changed) values of each data point are shown on their x-axes.

Figure 2

This figure presents the results of varying each of the data values in the batch $-1.02, -0.82, -0.23, -0.17, -0.08, 0.77$ (the same one analyzed in the first figure). There is one plot for each data value, which is highlighted on its plot with a long black tick along the bottom axis. (The remaining data values are shown with short gray ticks.) The blue curve traces the $L_2$ summary--the arithmetic mean--and the red curve traces the $L_1$ summary--the median. (Since often the median is a range of values, the convention of plotting the middle of that range is followed here.)

Notice:

  1. The sensitivity of the mean is unbounded: those blue lines extend infinitely far up and down. The sensitivity of the median is bounded: there are upper and lower limits to the red curves.

  2. Where the median does change, though, it changes much more rapidly than the mean. The slope of each blue line is $1/6$ (generally it is $1/n$ for a dataset with $n$ values), whereas the slopes of the tilted parts of the red lines are all $1/2$.

  3. The mean is sensitive to every data point and this sensitivity has no bounds (as the nonzero slopes of all the colored lines in the bottom left plot of the first figure indicate). Although the median is sensitive to every data point, the sensitivity is bounded (which is why the colored curves in the bottom right plot of the first figure are located within a narrow vertical range around zero). These, of course, are merely visual reiterations of the basic force (loss) law: quadratic for the mean, linear for the median.

  4. The interval over which the median can be made to change can vary among the data points. It is always bounded by two of the near-middle values among the data which are not varying. (These boundaries are marked by faint vertical dashed lines.)

  5. Because the rate of change of the median is always $1/2$, the amount by which it might vary therefore is determined by the length of this gap between near-middle values of the dataset.

Although only the first point is commonly noted, all the points are important. In particular,

  • It is definitely false that the "median does not depend on every value." This figure provides a counterexample.

  • Nevertheless, the median does not depend "materially" on every value in the sense that although changing individual values can change the median, the amount of change is limited by the gaps among near-middle values in the dataset. In particular, the amount of change is bounded. We say that the median is a "resistant" summary.

  • Although the mean is not resistant, and will change whenever any data value is changed, the rate of change is relatively small. The larger the dataset, the smaller the rate of change. Equivalently, in order to produce a material change in the mean of a large dataset, at least one value must undergo a relatively large variation. This suggests the non-resistance of the mean is of concern only for (a) small datasets or (b) datasets where one or more data might have values extremely far from the middle of the batch.

These remarks--which I hope the figures make evident--reveal a deep connection between the loss function and the sensitivity (or resistance) of the estimator. For more about this, begin with one of the Wikipedia articles on M-estimators and then pursue those ideas as far as you like.


Code

This R code produced the figures and can readily be modified to study any other dataset in the same way: simply replace the randomly-created vector y with any vector of numbers.

#
# Create a small dataset.
#
set.seed(17)
y <- sort(rnorm(6)) # Some data
#
# Study how a statistic varies when the first element of a dataset
# is modified.
#
statistic.vary <- function(t, x, statistic) {
  sapply(t, function(e) statistic(c(e, x[-1])))
}
#
# Prepare for plotting.
#
darken <- function(c, x=0.8) {
  apply(col2rgb(c)/255 * x, 2, function(s)  rgb(s[1], s[2], s[3]))
}
colors <- darken(c("Blue", "Red"))
statistics <- c(mean, median); names(statistics) <- c("mean", "median")
x.limits <- range(y) + c(-1, 1)
y.limits <- range(sapply(statistics, 
                         function(f) statistic.vary(x.limits + c(-1,1), c(0,y), f)))
#
# Make the plots.
#
par(mfrow=c(2,3))
for (i in 1:length(y)) {
  #
  # Create a standard, consistent plot region.
  #
  plot(x.limits, y.limits, type="n", 
       xlab=paste("Value of y[", i, "]", sep=""), ylab="Estimate",
       main=paste("Sensitivity to y[", i, "]", sep=""))
  #legend("topleft", legend=names(statistics), col=colors, lwd=1)
  #
  # Mark the limits of the possible medians.
  #
  n <- length(y)/2
  bars <- sort(y[-1])[ceiling(n-1):floor(n+1)]
  abline(v=range(bars), lty=2, col="Gray")
  rug(y, col="Gray", ticksize=0.05);
  #
  # Show which value is being varied.
  #
  rug(y[1], col="Black", ticksize=0.075, lwd=2)
  #
  # Plot the statistics as the value is varied between x.limits.
  #
  invisible(mapply(function(f,c) 
    curve(statistic.vary(x, y, f), col=c, lwd=2, add=TRUE, n=501),
    statistics, colors))
  y <- c(y[-1], y[1])    # Move the next data value to the front
}
#------------------------------------------------------------------------------#
#
# Study loss functions.
#
loss <- function(x, y, f) sapply(x, function(t) sum(f(y-t)))
square <- function(t) t^2
square.d <- function(t) 2*t
abs.d <- sign
losses <- c(square, abs, square.d, abs.d)
names(losses) <- c("Squared Loss", "Absolute Loss",
                   "Change in Squared Loss", "Change in Absolute Loss")
loss.types <- c(rep("Loss (energy)", 2), rep("Change in loss (force)", 2))
#
# Prepare for plotting.
#
colors <- darken(rainbow(length(y)))
x.limits <- range(y) + c(-1, 1)/2
#
# Make the plots.
#
par(mfrow=c(2,2))
for (j in 1:length(losses)) {
  f <- losses[[j]]
  y.range <- range(c(0, 1.1*loss(y, y, f)))
  #
  # Plot the loss (or its rate of change).
  #
  curve(loss(x, y, f), from=min(x.limits), to=max(x.limits), 
        n=1001, lty=3,
        ylim=y.range, xlab="Value", ylab=loss.types[j],
        main=names(losses)[j])
  #
  # Draw the x-axis if needed.
  #
  if (sign(prod(y.range))==-1) abline(h=0, col="Gray")
  #
  # Faintly mark the data values.
  #
  abline(v=y, col="#00000010")
  #
  # Plot contributions to the loss (or its rate of change).
  #
  for (i in 1:length(y)) {
    curve(loss(x, y[i], f), add=TRUE, lty=1, col=colors[i], n=1001)
  }
  rug(y, side=3)
}