Solved – Kernel Density estimation function and bandwidth selection

kernel-smoothingnonparametric

My question is to do with

1) how to identify the best kernel function to use (for instance Epanechnikov, Gaussian, triangle etc) for earnings on formal and informal sector using Stata

2) how do I work out the bandwidth which would bring out the best estimation.

Best Answer

Check out the webpage: https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/

1) You can test them separately. keep some of your data as validation data, do the KDE without the validation data, look at the likelihood of the validation data in the KDE model. The kernel which gives the highest likelihood is probably the best kernel.

2) You can do cross-validation to get the best parameter. There is a section "Bandwidth Cross-Validation in Scikit-Learn" in the link, which shows you how to do it in a couple of lines.

EDIT: Here is a demonstration of how you would do it (code mainly taken from the link). The code is in Python, which is easy to use for this kind of application:

nsamp=500

data=np.concatenate([np.array(np.random.normal(loc=0.1, scale=0.2, size=nsamp)), \
                     np.array(np.random.normal(loc=2.9, scale=0.8, size=nsamp))])
data=np.reshape(data,(2*nsamp,1))

#seperate into validation and training
val_data=data[0:int(nsamp*0.9)]
train_data=data[int(nsamp*0.9):]

#look at the data with a histogram
plt.hist(train_data,bins=100, normed=True)

#1. now do the KDE with Gaussian kernel with cross validation
grid = GridSearchCV(KernelDensity(kernel='gaussian'), {'bandwidth': np.linspace(0.01, 1.5, 20)}, cv=5) # 20-fold cross-validation
grid.fit(train_data)
print("grid.best_params_: " + str(grid.best_params_))

#get the best estimator
kde_gauss=grid.best_estimator_

##to play arounf and see the effects of changing the bandwidth
#kde_gauss = KernelDensity(kernel='gaussian', bandwidth=0.1)
#kde_gauss.fit(train_data)

#what is the likelihood of validation data
kde_gauss.score(val_data)

#look at the fitted pdf
plt.plot(np.linspace(-1,5,1000), np.exp(kde_gauss.score_samples(np.reshape(np.linspace(-1,5,1000),(1000,1)) )))


#2. now do the KDE with tophat kernel with cross validation
grid = GridSearchCV(KernelDensity(kernel='tophat'), {'bandwidth': np.linspace(0.01, 1.5, 20)}, cv=5) # 20-fold cross-validation
grid.fit(train_data)
print("grid.best_params_: " + str(grid.best_params_))

#get the best estimator
kde_tophat=grid.best_estimator_

#look at the fitted pdf
plt.plot(np.linspace(-1,5,1000), np.exp(kde_tophat.score_samples(np.reshape(np.linspace(-1,5,1000),(1000,1)) )))

#what is the likelihood of validation data
kde_tophat.score(val_data)

Related Solutions

Solved – Kernel bandwidth in Kernel density estimation

One place to start would be Silverman's nearest-neighbor estimator, but to add in the weights somehow. (I am not sure exactly what your weights are for here.) The nearest neighbor method can evidently be formulated in terms of distances. I believe your first and second nearest neighbor method are versions of the nearest-neighbor method, but without a kernel function, and with a small value of $k$.

Solved – np package kernel density estimation with Epanechnikov kernel

EDIT

This is explained in the FAQ:

I use plot() (npplot()) to plot, say, a density and the resulting plot looks like an inverted density rather than a density

This can occur when the datadriven bandwidth is dramatically undersmoothed. Data-driven (i.e., automatic) bandwidth selection procedures are not guaranteed always to produce good results due to perhaps the presence of outliers or the rounding/discretization of continuous data, among others. By default, npplot() takes the two extremes of the data (minimum, maximum i.e., actual data points) then creates an equally spaced grid of evaluation data (i.e., not actual data points in general) and computes the density for these points. Since the bandwidth is extremely small, the density estimate at these evaluation points is correctly zero, while those for the sample realizations (in this case only two, the min and max) are non-zero, hence we get two peaks at the edges of the plot and a flat bowl equal to zero everywhere else. This can also happen when your data is heavily discretized and you treat it as continuous. In such cases, treating the data as ordered may result in more sensible estimates

As suggested treating the data as ordered, works:

blep<-npudensbw(~ordered(geyser$waiting), 
                bwmethod="cv.ls", ckertype="epanechnikov", ckerorder=2)

It also succeeds with higher kernel orders, such as with ckerorder=4 in this example:

Best Answer

Related Solutions

Solved – Kernel bandwidth in Kernel density estimation

Solved – np package kernel density estimation with Epanechnikov kernel

Related Question