R Probability KDE – Estimating the Density of a Zero-Inflated Parameter in R

kdeprobabilityr

I have a data set with lots of zeros that looks like this:

set.seed(1)
x <- c(rlnorm(100),rep(0,50))
hist(x,probability=TRUE,breaks = 25)

I would like to draw a line for its density, but the density() function uses a moving window that calculates negative values of x.

lines(density(x), col = 'grey')

There is a density(... from, to) arguments, but these seem to only truncate the calculation, not alter the window so that the density at 0 is consistent with the data as can be seen by the following plot :

lines(density(x, from = 0), col = 'black')

(if the interpolation was changed, I would expect that the black line would have higher density at 0 than the grey line)

Are there alternatives to this function that would provide a better calculation of the density at zero?

enter image description here

Best Answer

The density is infinite at zero because it includes a discrete spike. You need to estimate the spike using the proportion of zeros, and then estimate the positive part of the density assuming it is smooth. KDE will cause problems at the left hand end because it will put some weight on negative values. One useful approach is to transform to logs, estimate the density using KDE, and then transform back. See Wand, Marron & Ruppert (JASA 1991) for a reference.

The following R function will do the transformed density:

logdensity <- function (x, bw = "SJ") 
{
    y <- log(x)
    g <- density(y, bw = bw, n = 1001)
    xgrid <- exp(g$x)
    g$y <- c(0, g$y/xgrid)
    g$x <- c(0, xgrid)
    return(g)
}

Then the following will give the plot you want:

set.seed(1)
x <- c(rlnorm(100),rep(0,50))
hist(x,probability=TRUE,breaks = 25)
fit <- logdensity(x[x>0]) # Only take density of positive part
lines(fit$x,fit$y*mean(x>0),col="red") # Scale density by proportion positive
abline(v=0,col="blue") # Add spike at zero.

enter image description here

Related Question