The optimal bandwidth for derivative estimation will be different from the bandwidth for density estimation. In general, every feature of a density has its own optimal bandwidth selector.
If your objective is to minimize mean integrated squared error (which is the usual criterion) there is nothing subjective about it. It is a matter of deriving the value that minimizes the criterion. The equations are given in Section 2.10 of Hansen (2009).
The tricky part is that the optimal bandwidth is a function of the density itself, so this solution is not directly useful. There are a number of methods around to try to deal with that problem. These usually approximate some functionals of the density using normal approximations. (Note, there is no assumption that the density itself is normal. The assumption is that some functionals of the density can be obtained assuming normality.)
Where the approximations are imposed determines how good the bandwidth selector is. The crudest approach is called the "normal reference rule" which imposes the approximation at a high level. The end of Section 2.10 in Hansen (2009) gives the formula using this approach. This approach is implemented in the hns()
function from the ks
package on CRAN. That's probably the best you will get if you don't want to write your own code. So you can estimate the derivative of a density as follows (using ks
):
library(ks)
h <- hns(x,deriv.order=1)
den <- kdde(x, h=h, deriv.order=1)
A better approach, usually known as a "direct plug in" selector, imposes the approximation at a lower level. For straight density estimation, this is the Sheather-Jones method, implemented in R using density(x,bw="SJ")
. However, I don't think there is a similar facility available in any R package for derivative estimation.
Rather than use straight kernel estimation, you may be better off with a local polynomial estimator. This can be done using the locpoly()
function from the ks
package in R. Again, there is no optimal bandwidth selection implemented, but the bias will be smaller than for kernel estimators. e.g.,
den2 <- locpoly(x, bandwidth=?, drv=1) # Need to guess a sensible bandwidth
There is no reason to choose the optimal bandwith because, as you said, your data does not meet the assumption under which it is derived.
This is a parameter selection problem and it is very easy to overfit using kde and thus end up with a lot of peaks. If you are having multiple peaks then chances are that you are not modeling the true distribution but only your dataset. If you really want to use kde then choose a larger bandwith so that you get 2 peaks, no need for post processing.
But apparently you have some prior knowledge about your distribution and that you should use. If your distribution has 2 peaks that are both due to 2 gaussians then I suggest you use MLE for mixture of gaussians, it will probably give you better results.
PS: I would have comment before posting an answer but I can't ...
Best Answer
For simplicity, let's assume that we are talking about some really simple kernel, say triangular kernel:
$$ K(x) = \begin{cases} 1 - |x| & \text{if } x \in [-1, 1] \\ 0 & \text{otherwise} \end{cases} $$
Recall that in kernel density estimation for estimating density $\hat f_h$ we combine $n$ kernels parametrized by $h$ centered at points $x_i$:
$$ \hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big) $$
Notice that by $\frac{x-x_i}{h}$ we mean that we want to re-scale the difference of some $x$ with point $x_i$ by factor $h$. Most of the kernels (excluding Gaussian) are limited to the $(-1, 1)$ range, so this means that they will return densities equal to zero for points out of $(x_i-h, x_i+h)$ range. Saying it differently, $h$ is scale parameter for kernel, that changes it's range from $(-1, 1)$ to $(-h, h)$.
This is illustrated on the plot below, where $n=7$ points are used for estimating kernel densities with different bandwidthes $h$ (colored points on top mark the individual values, colored lines are the kernels, gray line is overall kernel estimate). As you can see, $h < 1$ makes the kernels narrower, while $h > 1$ makes them wider. Changing $h$ influences both the individual kernels and the final kernel density estimate, since it's a mixture distribution of individual kernels. Higher $h$ makes the kernel density estimate smoother, while as $h$ gets smaller it leads to kernels being closer to individual datapoints, and with $h \rightarrow 0$ you would end up with just a bunch of Direc delta functions centered at $x_i$ points.
And the R code that produced the plots:
For more details you can check the great introductory books by Silverman (1986) and Wand & Jones (1995).
Silverman, B.W. (1986). Density estimation for statistics and data analysis. CRC/Chapman & Hall.
Wand, M.P and Jones, M.C. (1995). Kernel Smoothing. London: Chapman & Hall/CRC.