Solved – Representing the uncertainty of the median in a clean way

distributionsmeanmedian

So suppose I want to estimate the fraction of the year a user of some demographics is active, and I have n users (effectively, n samples). So I get the fraction of the year each user is active.

It turns out that the distribution is lognormal for non-active user types, but the log-transformed variable is negatively skewed for very active user types (since there is an upper bound).

In this case, the median is better than the mean as a "typical" value, but how do I represent uncertainties of this estimate? If the distribution were symmetric, then $mean \pm stdev$ will be a good way to do it, but in a skewed distribution, should I represent it as $median \pm something$ something (maybe IQR/2? std?)? What's the best way to represent it?

Best Answer

In this case, the median is better than the mean as a "typical" value

This is a misleading oversimplification common to many introductory texts. It depends on what you want to find out (/what typical means for your particular needs). It can be perfectly reasonable to be interested in the behaviour of the mean with a skewed distribution, and where the mean isn't quite what you need, it may easily be that the median is less useful still!

As an example, consider how long I wait for the lights at a particular intersection while travelling to work. This waiting time is quite right skew. I am interested in the average wait (for example, in working out my average travel time or how much time I waste at that intersection in a year); on the other hand if I am worried about being late to work, neither the mean nor the median are much direct use; I'm much more interested in the chance that the wait might exceed say 10 minutes (in heavy traffic it does sometimes happen)

how do I represent uncertainties of this estimate?

If you're interested in an interval for the population median for a continuous variate, one can be produced without even assuming a distributional form (a nonparametric/distribution-free estimate), based on sample quantiles (or more precisely, on order statistics).

This calculation is discussed in various places, including other questions on site. Also see Wikipedia or here for example. These small-sample intervals are "exact" in the sense that they have the relevant coverage (from the binomial), but those possible values for the coverage ('confidence level') are discrete (i.e. you can't form an exact 95% interval but for example at n=30, you can choose symmetric intervals that have coverages 99.5%, 98.4%, 95.7% and 90.1%).

This sort of interval is related to sign tests and proportions tests.

As an explicit example, let's say we have n=15 values randomly drawn from a continuously distributed population; one of the available levels is 96.5%, corresponding to the interval from the 4th to the 12th sorted observations. That is to say, under repeated sampling, about 96.5% of intervals from the 4th largest to the 12th largest observation would include the population median.

If you're interested in a standard error of the sample median, there's an asymptotic approximation that relies on knowing the height of the density at the median (in large samples from a continuous distribution the sample median will be asymptotically normal, so this can potentially be used to give large-sample intervals).

$\text{s.e.}(\tilde{x})\approx\frac{1}{2\sqrt{n}\,f(\tilde{\mu})}$

If your sample is also large enough to form a reliable estimate of scale (so that you can in turn get that estimate of the density at the median if you know $f$ up to location and scale) this may be useful.

should I represent it as median±something something (maybe IQR/2? std?)? What's the best way to represent it?

What's "best" for your purposes? What, if anything is known about the distribution?

Related Question