Solved – If the Epanechnikov kernel is theoretically optimal when doing Kernel Density Estimation, why isn’t it more commonly used

kernel-smoothingnonparametric

I have read (for example, here) that the Epanechnikov kernel is optimal, at least in a theoretical sense, when doing kernel density estimation. If this is true, then why does the Gaussian show up so frequently as the default kernel, or in many cases the only kernel, in density estimation libraries?

Best Answer

The reason why the Epanechnikov kernel isn't universally used for its theoretical optimality may very well be that the Epanechnikov kernel isn't actually theoretically optimal. Tsybakov explicitly criticizes the argument that the Epanechnikov kernel is "theoretically optimal" in pp. 16-19 of Introduction to Nonparametric Estimation (section 1.2.4).

Trying to summarize, under some assumptions on the kernel $K$ and a fixed density $p$ one has that the mean integrated square error is, of the form

$$\frac{1}{nh} \int K^2 (u) du + \frac{h^4}{4}S_K^2 \int (p''(x))^2 dx \,. \tag{1} $$

The main criticism of Tsybakov seems to be minimizing over non-negative kernels, since it's often possible to get better performing estimators, which are even non-negative, without restricting to non-negative kernels.

The first step of the argument for the Epanechnikov kernel begins by minimizing $(1)$ over $h$ and all non-negative kernels (rather than all kernels of a wider class) to get an "optimal" bandwidth for $K$

$$ h^{MISE}(K) = \left( \frac{\int K^2}{nS_K^2 \int (p'')^2} \right)^{1/5}$$

and the "optimal" kernel (Epanechnikov)

$$K^*(u) = \frac{3}{4}(1-u^2)_+ $$

whose mean integrated square error is:

$$h^{MISE}(K^*) = \left( \frac{15}{n \int (p'')^2} \right)^{1/5} \,. $$

These however aren't feasible choices, since they depend on knowledge (via $p''$) of the unknown density $p$ -- therefore they are "oracle" quantities.

A proposition given by Tsybakov implies that the asymptotic MISE for the Epanechnikov oracle is:

$$\lim_{n \to \infty} n^{4/5} \mathbb{E}_p \int (p_n^E (x) - p(x))^2 dx = \frac{3^{4/5}}{5^{1/5}4} \left( \int (p''(x))^2 dx \right)^{1/5} \,. \tag{2} $$

Tsybakov says (2) is often claimed to be the best achievable MISE, but then shows that one can use kernels of order 2 (for which $S_K =0$) to construct kernel estimators, for every $\varepsilon >0$, such that

$$ \limsup_{n \to \infty} n^{4/5} \mathbb{E}_p \int (\hat{p}_n (x) - p(x))^2 dx \le \varepsilon \,. $$

Even though $\hat{p}_n$ isn't necessarily non-negative, one still has the same result for the positive part estimator, $p_n^+ := \max(0, \hat{p}_n)$ (which is guaranteed to be non-negative even if $K$ isn't):

$$ \limsup_{n \to \infty} n^{4/5} \mathbb{E}_p \int (p_n^+ (x) - p(x))^2 dx \le \varepsilon \,. $$

Therefore, for $\varepsilon$ small enough, there exist true estimators which have smaller asymptotic MISE than the Epanechnikov oracle, even using the same assumptions on the unknown density $p$.

In particular, one has as a result that the infimum of the asymptotic MISE for a fixed $p$ over all kernel estimators (or positive parts of kernel estimators) is $0$. So the Epanechnikov oracle is not even close to being optimal, even when compared to true estimators.

The reason why people advanced the argument for the Epanechnikov oracle in the first place is that one often argues that the kernel itself should be non-negative because the density itself is non-negative. But as Tsybakov points out, one doesn't have to assume that the kernel is non-negative in order to get non-negative density estimators, and by allowing other kernels one can non-negative density estimators which (1) aren't oracles and (2) perform arbitrarily better than the Epanechnikov oracle for a fixed $p$. Tsybakov uses this discrepancy to argue that it doesn't make sense to argue for optimality in terms of a fixed $p$, but only for optimality properties which are uniform over a class of densities. He also points out that the argument still works when using the MSE instead of MISE.

EDIT: See also Corollary 1.1. on p.25, where the Epanechnikov kernel is shown to be inadmissible based on another criterion. Tsybakov really seems not to like the Epanechnikov kernel.

Related Solutions

Solved – Kernel density estimation with an Epanechnikov kernel in MATLAB

Here is the rectified code - you basically need to check the matrix multiplications to make sure that the outputs have correct dimensions. for example,

 x=linspace(-1,1,1000); p=ones(n,1); xi=x*p;

will yield xi = 0 and so on. There were also some minor bugs like, setting the proper bandwith, taking the sum at each data points etc, which I think I have fixed. Please see below for the correct code and the resulting figure:

% Kernel density estimation using mixtures of normally distributed random % variables.
% Set parameters for data generating process. 
n=1000; % Number of observations. 
data1 = rand(n,1); % Returns data with 1000 observations.
% Initializing the zero matrix of data from the mixture. 
data2=zeros(n,1);
% % Generate the data. (mixture of normal random variables) 
data2 = (data1 < ones(n,1)/3).*normrnd(2,1,n,1) + ... 
 (data1 > ones(n,1)/3).*normrnd(-2,1,n,1);
% Generating the density (using an Epanechnikov kernel.) 
% Setting the parameters. 
s=std(data2); % standard deviation of the data 
h=1; % bandwidth parameter
% Evaluate the kernel at all x's in the domain. 
x=linspace(-3,3,1000); p=ones(n,1); xi=p*x; 
% matrix for the x's where we evaluate the kernel.
data2i=data2*ones(1,n); % matrix for each of the data points in the kernel. 
u=(xi-data2i)/h; % matrix of u's. 
absu=abs(u); % absolute value of u. 
I=(absu<=1); 
f=(.75/h)*(1-u.^2).*I;
ff = sum(f,1); 
plot(x,ff);

enter image description here

Needless to say, neither code is optimized - which of course was not the intention either.

Distributions – Practical Applications of Asymptotic Mean Integrated Squared Error in Kernel Density Estimation

The AMISE allows one to obtain an expression for the optimal bandwidth for the unknown density $f$. Unfortunately, the expression is in terms of derivatives of $f$. However, it is possible to derive a similar expression giving the optimal bandwidth for a kernel estimate of those derivatives. This is expressed in terms of even higher derivatives of $f$. And so on.

This seems like it might be an unending sequence of pointless theory. But the neat thing is, that for some sufficiently high order derivatives you can just assume that $f$ is normal. Then you can work your way back through the levels to find the bandwidth for $f$. It turns out that this works really well and almost nothing is lost provided $f$ is sufficiently smooth and enough levels of iteration are used (usually only 2 or 3 levels are needed).

The practical result is a bandwidth selection method which is general and quite robust. The most popular version of this is the Sheather-Jones plug-in method which is implemented in several software packages. In R, you can get a density estimate using the Sheather-Jones method:

density(x, bw="SJ")

That usually gives better results than the default bandwidth.

Best Answer

Related Solutions

Solved – Kernel density estimation with an Epanechnikov kernel in MATLAB

Distributions – Practical Applications of Asymptotic Mean Integrated Squared Error in Kernel Density Estimation

Related Question