Solved – Kernel Density estimation function and bandwidth selection

kernel-smoothingnonparametric

My question is to do with

1) how to identify the best kernel function to use (for instance Epanechnikov, Gaussian, triangle etc) for earnings on formal and informal sector using Stata

2) how do I work out the bandwidth which would bring out the best estimation.

Best Answer

Check out the webpage: https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/

1) You can test them separately. keep some of your data as validation data, do the KDE without the validation data, look at the likelihood of the validation data in the KDE model. The kernel which gives the highest likelihood is probably the best kernel.

2) You can do cross-validation to get the best parameter. There is a section "Bandwidth Cross-Validation in Scikit-Learn" in the link, which shows you how to do it in a couple of lines.

EDIT: Here is a demonstration of how you would do it (code mainly taken from the link). The code is in Python, which is easy to use for this kind of application:

nsamp=500

data=np.concatenate([np.array(np.random.normal(loc=0.1, scale=0.2, size=nsamp)), \
                     np.array(np.random.normal(loc=2.9, scale=0.8, size=nsamp))])
data=np.reshape(data,(2*nsamp,1))

#seperate into validation and training
val_data=data[0:int(nsamp*0.9)]
train_data=data[int(nsamp*0.9):]

#look at the data with a histogram
plt.hist(train_data,bins=100, normed=True)

#1. now do the KDE with Gaussian kernel with cross validation
grid = GridSearchCV(KernelDensity(kernel='gaussian'), {'bandwidth': np.linspace(0.01, 1.5, 20)}, cv=5) # 20-fold cross-validation
grid.fit(train_data)
print("grid.best_params_: " + str(grid.best_params_))

#get the best estimator
kde_gauss=grid.best_estimator_

##to play arounf and see the effects of changing the bandwidth
#kde_gauss = KernelDensity(kernel='gaussian', bandwidth=0.1)
#kde_gauss.fit(train_data)

#what is the likelihood of validation data
kde_gauss.score(val_data)

#look at the fitted pdf
plt.plot(np.linspace(-1,5,1000), np.exp(kde_gauss.score_samples(np.reshape(np.linspace(-1,5,1000),(1000,1)) )))


#2. now do the KDE with tophat kernel with cross validation
grid = GridSearchCV(KernelDensity(kernel='tophat'), {'bandwidth': np.linspace(0.01, 1.5, 20)}, cv=5) # 20-fold cross-validation
grid.fit(train_data)
print("grid.best_params_: " + str(grid.best_params_))

#get the best estimator
kde_tophat=grid.best_estimator_

#look at the fitted pdf
plt.plot(np.linspace(-1,5,1000), np.exp(kde_tophat.score_samples(np.reshape(np.linspace(-1,5,1000),(1000,1)) )))

#what is the likelihood of validation data
kde_tophat.score(val_data)