Solved – kernel density estimation of the log-normal distribution

kernel-smoothinglognormal distribution

I stumbled upon the following issue I cannot make sense of:

When using default choices, the KDE for a log-normal sample (green) does not look like a density that integrates to 1, compare the true density (violet):

enter image description here

I created this using

set.seed(1)
n <- 1e6
xax <- seq(-3,20,by=.1)

x <- rlnorm(n)
plot(density(x),lwd=3,col="seagreen",xlim=c(-3,20))
lines(xax,dlnorm(xax),lwd=3,col="palevioletred1")

This does not look obviously wrong, because it seems to produce decent results when applied to a $\chi^2$-distribution, and excellent results for a normal population.

enter image description here

x <- rchisq(n,2)
plot(density(x),lwd=3,col="seagreen",xlim=c(-3,20))
lines(xax,dchisq(xax,2),lwd=3,col="palevioletred1")

enter image description here

x <- rnorm(n,10)
plot(density(x),lwd=3,col="seagreen",xlim=c(-3,20))
lines(xax,dnorm(xax,10),lwd=3,col="palevioletred1")

Best Answer

One thing that concerns me about that is your bandwidth is 1/3 of the distance between the points your density estimate is evaluated at.

 diff(density(x)$x[1:10])
[1] 0.2055696 0.2055696 0.2055696 0.2055696 0.2055696 0.2055696 0.2055696
[8] 0.2055696 0.2055696

vs

> density(x)$bw
[1] 0.06165546

This can potentially lead to odd results.

Indeed, that seems as if it may be most of the problem.

Try density(x,n=2^14) in your code. (Actually, it looks like $2^{12}$ would do, and even $2^{10}$ is a substantial improvement.)

enter image description here

You can see the pink here almost entirely obscures the green.

This issue of a small bandwidth relative to the inter-evaluation-point gap* is caused by the very large sample size; because bandwidth is proportional to $n^{-\frac15}$, with large enough $n$, eventually this will happen even with Gaussian data.

*[which is the (extended) range divided by default number of evaluation points (512)]

Related Question