Probability Density – How to Calculate Overlap Between Empirical Probability Densities

density functionkernel-smoothingprobabilityr

I'm looking for a method to calculate the area of overlap between two kernel density estimates in R, as a measure of similarity between two samples. To clarify, in the following example, I would need to quantify the area of the purplish overlapping region:

library(ggplot2)
set.seed(1234)
d <- data.frame(variable=c(rep("a", 50), rep("b", 30)), value=c(rnorm(50), runif(30, 0, 3)))
ggplot(d, aes(value, fill=variable)) + geom_density(alpha=.4, color=NA)

enter image description here

A similar question was discussed here, the difference being that I need to do this for arbitrary empirical data rather than predefined normal distributions. The overlap package addresses this question, but apparently only for timestamp data, which doesn't work for me. The Bray-Curtis index (as implemented in vegan package's vegdist(method="bray") function) also seems relevant but again for somewhat different data.

I'm interested in both the theoretical approach and the R functions I might employ to implement it.

Best Answer

The area of overlap of two kernel density estimates may be approximated to any desired degree of accuracy.

1) Since the original KDEs have probably been evaluated over some grid, if the grid is the same for both (or can easily be made the same), the exercise could be as easy as simply taking $\min(K_1(x),K_2(x))$ at each point and then using the trapezoidal rule, or even a midpoint rule.

If the two are on different grids and can't easily be recalculated on the same grid, interpolation could be used.

2) You might find the point (or points) of intersection and integrate the lower of the two KDEs in each interval where each one is lower. In your diagram above you'd integrate the blue curve to the left of the intersection and the pink one to the right by whatever means you like/have available. This can be done essentially exactly by considering the area under each kernel component $\frac{1}{h}K(\frac{x-x_i}{h})$ to the left or right of that cut-off point.

However, whuber's comments above should be clearly borne in mind -- this is not necessarily a very meaningful thing to do.

Related Question