Solved – How to draw mean, median and mode lines in R that end at density

data visualizationmeanmedianmoder

I have drawn a skewed distribution, using these commands:

x <- seq(-2.5, 10, length=1000000)
hx5 <- rnorm(x,0,1) + rexp(x,1/5) # tau=5 (rate = 1/tau)
plot(density(hx5), xlim=c(-2.5,10), type="l", col="green",
     xlab="x", main="ExGaussian  curve",lwd=2)

Now I want to draw three lines, for the mean, the mode and the median of the distribution. If I simply write, for example:

abline(v=median(hx5))

the line go out of the curve, but I want to end the line with the density point of the parameter. So, my problem is:

how can I find the values of the density at the mean, mode and median of my observations in order to set the correct coordinates for drawing?

Best Answer

The density is represented as a polyline, which is a pair of parallel arrays, one for $x$, one for $y$, forming vertices along the graph of the density (with equal spacings in the $x$ direction). As such it is a discrete approximation to the idealized continuous density and we can use discrete versions of the relevant integrals to compute statistics. Because the spacing is typically so close, there's probably little need to interpolate between successive points: we can use simple algorithms.

Whence,

x <- seq(-2.5, 10, length=1000000)
hx5 <- rnorm(x,0,1) + rexp(x,1/5) # tau=5 (rate = 1/tau)
#
# Compute the density.
#
dens <- density(hx5)
#
# Compute some measures of location.
#
n <- length(dens$y)                       #$
dx <- mean(diff(dens$x))                  # Typical spacing in x $
y.unit <- sum(dens$y) * dx                # Check: this should integrate to 1 $
dx <- dx / y.unit                         # Make a minor adjustment
x.mean <- sum(dens$y * dens$x) * dx
y.mean <- dens$y[length(dens$x[dens$x < x.mean])] #$
x.mode <- dens$x[i.mode <- which.max(dens$y)]
y.mode <- dens$y[i.mode]                  #$
y.cs <- cumsum(dens$y)                    #$
x.med <- dens$x[i.med <- length(y.cs[2*y.cs <= y.cs[n]])] #$
y.med <- dens$y[i.med]                                    #$
#
# Plot the density and the statistics.
#
plot(dens, xlim=c(-2.5,10), type="l", col="green",
     xlab="x", main="ExGaussian curve",lwd=2)
temp <- mapply(function(x,y,c) lines(c(x,x), c(0,y), lwd=2, col=c), 
               c(x.mean, x.med, x.mode), 
               c(y.mean, y.med, y.mode), 
               c("Blue", "Gray", "Red"))

Plot

Related Solutions

Solved – Why are statistical properties of mode and median difficult to determine

For 'statistical properties' read 'computing the distribution of', or 'computing some aspect of the distribution of' (such as variance, say).

In particular, they're probably referring to the sampling distribution of a sample statistic.

Sample means have some quite nice properties*, so that in many cases, for example, it's relatively easy to compute the mean, variance (and covariances) of the distribution of sample means, and asymptotically, we have the central limit theorem which tells us about distributions of means in large samples.

* means of sums "add", and variances of sums of independent variables also "add" (i.e. the mean of a sum is the sum of the means, the variance of a sum of independent variables is a sum of the variances, which makes the mean and variance of the distributions of sample means - usually - quite easy to find.)

By contrast, sample medians (and other quantiles) are often more difficult to work with, and don't have nice linear properties like that. Nevertheless, sometimes we can make progress in finite samples, and asymptotically (i.e. in very large samples) they tend to behave relatively more 'nicely'.

Modes are much worse. Generally speaking, they really don't have very 'nice' properties; for example, it's relatively easy for modes to 'jump about' in somewhat surprising ways when you take averages, and even asymptotically the variance of a mode doesn't decrease as $1/n$.

Solved – Median and mode

Your question raises small questions of terminology and more interesting questions of how to think about data. I will stick with your question and focus on discrete variables. Most of what I say carries over to continuous variables, but with some need for re-wording and/or some differences in procedures.

First off, the mode is usually introduced and defined as just the most common value, namely the value with the highest frequency. That is the strictest sense of the term mode. When data are discrete, we look at frequencies and identify the value with the highest frequency.

When there are ties for the highest frequency, then so also there are ties for mode. With these invented data

value       frequency 
 0            1
 1           42 
 2            1
 3            1
 4           42 
 5            1 
 6            1

there is clearly a tie, with two modes at 1 and 4. However, had the frequencies been 42 and 41 it's my guess that most experienced users of statistics would still say that there were two modes, regardless of the rule that the mode is the value with the highest frequency. So, it is also true that a mode is a value with a pronounced peak in a frequency distribution, i.e. with a frequency notably higher than neighbouring values. (It's possible and common for a mode to be either the minimum or the maximum.)

Don't ask for a precise rule, or even rule of thumb, on what counts as pronounced or notable; it's what is obvious when you graph it and the decision comes quickly with a little experience.

The importance and interest of modes often lies in what they indicate, which is sometimes that there are qualitatively different groups being mixed together in the sample, such as men and women or healthy and sick people. Sometimes there are physical reasons for having two modes. In some climates, there are two common states of cloudiness of the sky, almost cloud-free days and clouded-over days.

I've not seen the term modality being used except occasionally as meaning the number of modes. Statistical people certainly talk about bimodality, meaning that the data are bimodal, or have two modes; or multimodality, meaning that the data are multimodal, or have many modes. Some of these terms are a little unnecessary and arguably relict from a time when there was a stronger inclination among scholars and scientists to invent words based on Latin and Greek roots, but they are quite often used.

The second part of your question I read as asking whether the median should be computed as the average of two modes. I may be missing something here, but I guess you are mixing in a quite different question. The computation of medians has nothing to do with modes at all. It's just the convention that with an even number of values, you should report the median as the mean of the two middle values. That's a convention, but it is taught in introductory statistics courses as a rule to be followed. With grouped data, the principle is still the same. It's quite possible with discrete data that interpolation will cause the median to be reported as a value that is not observable, e.g. 2.5 children. That's not something to worry about.

Back to terminology: I'd assert that modality is not another word for mode. Still less can it be used to refer to any value.

EDIT: I tried to pitch my original answer in a way that should help others apart from the OP. I focused on what seemed the more interesting question of a mode is, and downplayed a question about the median which seems confused on a point well covered in just about every elementary text. I've not tried to keep pace with repeated edits of the original question with more emphasis on how to compute medians.

Best Answer

Related Solutions

Solved – Why are statistical properties of mode and median difficult to determine

Solved – Median and mode

Related Question