Solved – R Kernel Density Plots – Weird Proportions

kernel-smoothingr

When plotting two kernel desitiy estimates into the same frame

plot(density(all4[,3],kernel="epanechnikov"),col='red',ylim=c(0,0.4),xlim=c(-50,50))
lines(density(all2[,3],kernel="epanechnikov"),col="blue",ylim=c(0,0.4),xlim=c(-50,50))

the result looks like this:

enter image description here

which is quite strange since the area under both plots should be unity. The data related to the red curve has an outlier (its maximum is 702.6).

> summary(all4[,3])
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-22.3100  -4.0940  -0.4151  -0.1533   3.4880 702.6000 
> summary(all2[,3])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-23.520  -4.041  -0.367  -0.109   3.533  67.780 

After removing the outlier the curves match pretty well

a4 = sort(all4[,3])
a2 = sort(all2[,3])
a4 = a4[6:999995]
a2 = a2[6:999995]
plot(density(a4,kernel="epanechnikov"),col='red',ylim=c(0,0.2),xlim=c(-50,50),lwd=3)
lines(density(a2,kernel="epanechnikov"),col="blue",ylim=c(0,0.2),xlim=c(-50,50))

> summary(a4)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-21.0200  -4.0940  -0.4151  -0.1541   3.4880  39.6700 
> summary(a2)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-21.9200  -4.0410  -0.3670  -0.1091   3.5330  32.2900 

enter image description here

I am confused since the outlier should STEAL probability mass from the body of the distribution. But the red curve in the first picture seems to be inflated.
Moreover, since both density curves are on the same scales, the area under both curves should be the same, also in the first plot. Where is my mistake?

EDIT: Following Momos recommendations I integrated over the densities. Indeed, the result for the weird one is far beyond unity!

library(sfsmisc)
dd1 = density(all4[,3],kernel="epanechnikov")
dd2 = density(all2[,3],kernel="epanechnikov")
integrate.xy(dd1$x,dd1$y)
[1] 1.490016
integrate.xy(dd2$x,dd2$y)
[1] 0.9977962

Best Answer

The problem is discretization error.

density divides the range of data by default into n=512 intervals. The outlier causes the intervals to be at least an order of magnitude greater than otherwise. With very large amounts of data (it seems to take somewhere more than $10^5$ for this problem to become really apparent with the default value of n and this order-of-magnitude outlier), the cumulative discretization error takes over.

The general lesson--which is by no means confined to just density estimation nor to R--is that when routine tools are pushed towards their limits, we need to understand how they do their calculations and we should remember to pay attention to all the parameters that control their accuracy.


I haven't reverse-engineered the source code to pin this down definitively, but I will supply some empirical evidence using random data sets.

trial <- function(n, mean=0, sd=5, outlier=700, kernel="epanechnikov", width=512) {
  set.seed(17) # Makes everything reproducible
  x <- c(outlier, rnorm(n, mean, sd))
  #
  # Plot the densities for visual comparison.
  #
  plot(dd1 <- density(x, kernel=kernel, n=width), col="Red", xlim=c(-sd,sd)*5)
  lines(dd2 <- density(x[-1], kernel=kernel, n=width), col="Blue", xlim=c(-sd,sd)*5)
  #
  # Print crude integral estimates (trapezoidal rule) to memorialize the results.
  # 
  print (c(outlier=sum(diff(dd1$x)*(dd1$y[-1] + dd1$y[-length(dd1$y)])/2),
         non.outlier=sum(diff(dd2$x)*(dd2$y[-1] + dd2$y[-length(dd2$y)])/2)))
}
trial(10^6)                       # Looks like the error in the question
trial(10^6, kernel="rectangular") # Has similar results--problem is not the kernel
trial(10^6, width=2^13)           # Both integrals are essentially 1 now
trial(10^3)                       # Agreement is ok
trial(10^3, width=2^6)            # The problem is reproduced
trial(10^3, outlier=10^4)         # The problem is enormous due to the wide range