Solved – Why is the R density plot a bell curve when all datapoints are 0

density functionkernel-smoothingr

When I graph a density plot in R, and all the numbers are slightly greater than 0, I get essentially a vertical line at x = 0. But when all the numbers are exactly equal to 0, I get some sort of bell curve. Why is that? It seems counterintuitive.

The command I used to plot the curves was

for (i in 0:60) {
    cur_data = subset(data, time == i)
    p <- ggplot(cur_data, aes(x=error)) +
                geom_density() +
                theme_bw() +
                xlab(paste("Error distribution (minute ", i, ")", sep="")) +
                xlim(0, 1)
    ggsave(...)
}

At i = 60, cur_data should be entirely populated by the values 0.0.

Density plot, entries slightly greater than 0

Density plot, entries all 0

(Originally posted on Stack Overflow; was told to post here.)

Best Answer

You should explain what the intuition is that you have that the behavior runs counter to - it would make it easier to focus the explanation to address that.

A kernel density estimate is the convolution of the sample probability function ($n$ point masses of size $\frac{1}{n}$) and the kernel function (itself, by default, a normal density).

The result in the default case is a mixture of normal (Gaussian) densities, each with center at the data values, each with standard deviation $h$ (the bandwidth of the kernel), and weight $\frac{1}{n}$.

When all the data are coincident, the resulting mixture density is a sum of $n$ weighted densities, all with the same mean and standard deviation ... which is just the kernel itself, centered at that data value.

The difference in behavior you see might relate to the trim argument in ggplot2::stat_density. When the range of values is exactly zero, my guess is that it's setting trim to FALSE (or at least something other than TRUE), but when it's even a little larger than 0 it's at the default (TRUE). You'd need to look into the source to double check, but that would be my guess. If that's what's happening, you should be able to modify that behavior.

Related Solutions

Solved – Why do a density plot and a rug plot seem to disagree

From the R package MASS, of the $506$ total observations in Boston, $369$ have a value for tax below 470 and $137$ have a value for tax above 665. In fact 666 is by far the most common value in the data set, appearing $132$ times.

So if the area of the density plot to the left is about twice the area to the right, then that could reasonably be taken as representing the distribution. Visual inspection suggests this might be what is happening.

A more accurate representation would have the right peak much higher and narrower, and this could be achieved by adjusting the parameters.

Added for comments:

For example with a much narrower bandwidth for the density function and some manual jitter:

library(MASS)
plot(density(Boston$tax, bw=5)) 
rug(Boston$tax + rnorm(length(Boston$tax), sd=5), col=2, lwd=3.5)

you would get something like this

Solved – Why density plot tails are beyond maximum and minimum values

density method in R uses gaussian as its kernel by default. The algorithm is kernel density estimate, i.e. KDE, as also noted in the comments. It works as if we place a Gaussian density over each data point and sum all to obtain a smooth density curve. The density can extend over data boundaries because the kernel used is positive over the entire real axis. If you change the kernel to rectangular or triangular the density estimate will reach zero at some distant points but again it won't respect the data minimum and maximum. KDE is a powerful non-parametric density estimation method which means you don't assume a form, so it can't have a range. The aim is to approximate the underlying distribution; so, outside the data range the estimate will have comparably small density values which means lack of data around these points might suggest that the probability of having the next samples around here is low, but not impossible.

Best Answer

Related Solutions

Solved – Why do a density plot and a rug plot seem to disagree

Solved – Why density plot tails are beyond maximum and minimum values

Related Question