Solved – Kernel density estimation bandwidth selection

kernel-smoothingMATLABnonparametric

I've just started using Kernel Density Estimation for my study, and encountered a problem.

In KDE, we have to select a proper bandwidth $h$ according to the data. If we don't, it could lead wrong estimation as shown in the figure (taken from Wikipedia), where the gray line is a true PDF.

exapmle of KDE

MATLAB's built-in function ksdensity cannot handle with this issue while function kde (distributed here) can automatically set optimal bandwidth as is said in description.

My intention is to get two large peaks (to be precisely, the one whose sample's value is smaller) from the density function because the data ideally/theoretically has two peaks. However, using kde results in getting more than 2 peaks. It is said that the optimal bandwidth is
\begin{equation}
h=\hat\sigma\sqrt[5]{\frac{4}{3n}},
\end{equation}
where $\hat\sigma$ is the standard deviation of the samples provided that the kernel function is Gaussian basis and that samples follow normal distribution. Since the density function should have two peak as mentioned above, the samples don't seem to follow normal distribution and thus don't meet the assumption.

So the question is: is it OK to choose a bigger bandwidth to smooth the density function and to decrease the number of peaks, or should I use the bandwidth which kde selects by default and acquire the desired two peaks with some post-process?

It may be said that I could use either way as long as it returns the intended result, but I'm glad if there is some statistical evidence which supports the method.

I am very sorry that I cannot show the examples I'm actually working on due to sensitive concerns.

I appreciate any kinds of help. Thanks in advance.

Best Answer

There is no reason to choose the optimal bandwith because, as you said, your data does not meet the assumption under which it is derived.

This is a parameter selection problem and it is very easy to overfit using kde and thus end up with a lot of peaks. If you are having multiple peaks then chances are that you are not modeling the true distribution but only your dataset. If you really want to use kde then choose a larger bandwith so that you get 2 peaks, no need for post processing.

But apparently you have some prior knowledge about your distribution and that you should use. If your distribution has 2 peaks that are both due to 2 gaussians then I suggest you use MLE for mixture of gaussians, it will probably give you better results.

PS: I would have comment before posting an answer but I can't ...

Related Question