Solved – Interpreting Kernel density Plot

density-estimationeconometricsregression

Below I am showing the kernel density with the size of the informal economy, and would appreciate support on interpreting this.
For instance, what does the of the Kdensity line around .017 represent relative to the normal density line?

What does a bandwidth of 7.31 tell us?

enter image description here

Best Answer

Given a random sample from a population, a kernel density estimator (KDE) seeks to estimate the density function of the population distribution. You can read Wikipedia's article on KDEs or various other Internet pages for details of how a KDE is formed. (I have found referenced papers by Silverman to be extraordinarily clear.)

Roughly speaking, one chooses the shape of a 'kernel' density (often normal, sometimes uniform or others) and then makes a mixture of several such distributions as the KDE. The smaller the bandwidth, the more the components of the mixture. Results are often smoother than you get by trying to estimate a density function using a histogram. You can think of a KDE as a 'smoothed histogram', but the KDE works entirely independently of the histogram.

If you have a large sample, you will generally get a KDE that comes closer to the density function of the population.

Suppose you have a sample of size $n = 500$ from $\mathsf{Gamma}(\mathsf{shape}=5,\mathsf{rate}=0.1),$ which has $\mu=50,\sigma^2=500.$

set.seed(2021)
x = rgamma(500, 5, 0.1)
summary(x);  sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   4.33   32.18   44.91   49.11   62.64  163.26 
[1] 23.9333   # sample SD

Here is a histogram of the sample, a graph of the density function of $\mathsf{Gamma}(5, .1)$ [dotted black], individual observations [tick marks], and the default KDE from R [solid brown].

hdr = "n = 500: Sample from GAMMA(5,.1) with Density (dotted) and KDE"
hist(x, prob=T, col="skyblue2", br=20,  main=hdr);  rug(x)
 curve(dgamma(x, 5, .1), add=T, lwd=2, lty="dotted")
 lines(density(x), lwd=2, col="brown")

enter image description here

With obvious changes in the R code, here is a similar plot with $n = 10\,000$ observations. Here we have used KDEs with bandwidths half (with parameter 'adj=.5' in 'density') and double the default size.

set.seed(401)
x = rgamma(10^4, 5, .1)
hdr = "n = 100,000: Sample from GAMMA(5,.1) with KDEs of two bandwidths"
hist(x, prob=T, col="skyblue2", br=20,  main=hdr)
 curve(dgamma(x, 5, .1), add=T, lwd=2, lty="dotted")
 lines(density(x, adj=.5), lwd=2, col="green3")
 lines(density(x, adj=2), lwd=2, col="red", lty="dashed")

enter image description here

The narrower bandwidth (green) is not smooth near the mode, the wider bandwidth (red) is not quite right in the lower tail. Either KDE is better than the histogram with about 20 bins. In my experience, the default bandwidth in R is about right. (Default gaussian kernels are used throughout.)

R does not report the exact bandwidth it uses. I am happy to consider the bandwidth as as a technical matter and have found it more useful to see how well the KDE matches a histogram (or the true density curve, if known).

Related Question