Solved – np package kernel density estimation with Epanechnikov kernel

kernel-smoothingnonparametricr

I'm working with the "geyser" data set from the MASS package and comparing kernel density estimates of the np package.

My problem is to understand the density estimate using least squares cross-validation and the Epanechnikov kernel:

blep<-npudensbw(~geyser$waiting,bwmethod="cv.ls",ckertype="epanechnikov")
plot(npudens(bws=blep))

For the Gaussian kernel it seems to be fine:

blga<-npudensbw(~geyser$waiting,bwmethod="cv.ls",ckertype="gaussian")
plot(npudens(bws=blga))

Or if I use the Epanechnikov kernel and maximum likelihood cv:

bmax<-npudensbw(~geyser$waiting,bwmethod="cv.ml",ckertype="epanechnikov")
plot(npudens(~geyser$waiting,bws=bmax))

Is it my fault or is it a problem in the package?

Edit: If I use Mathematica for the Epanechnikov kernel and least squares cv it is working:

d = SmoothKernelDistribution[data, bw = "LeastSquaresCrossValidation", ker = "Epanechnikov"]
Plot[{PDF[d, x], {x, 20,110}]

Best Answer

EDIT

This is explained in the FAQ:

I use plot() (npplot()) to plot, say, a density and the resulting plot looks like an inverted density rather than a density

This can occur when the datadriven bandwidth is dramatically undersmoothed. Data-driven (i.e., automatic) bandwidth selection procedures are not guaranteed always to produce good results due to perhaps the presence of outliers or the rounding/discretization of continuous data, among others. By default, npplot() takes the two extremes of the data (minimum, maximum i.e., actual data points) then creates an equally spaced grid of evaluation data (i.e., not actual data points in general) and computes the density for these points. Since the bandwidth is extremely small, the density estimate at these evaluation points is correctly zero, while those for the sample realizations (in this case only two, the min and max) are non-zero, hence we get two peaks at the edges of the plot and a flat bowl equal to zero everywhere else. This can also happen when your data is heavily discretized and you treat it as continuous. In such cases, treating the data as ordered may result in more sensible estimates

As suggested treating the data as ordered, works:

blep<-npudensbw(~ordered(geyser$waiting), 
                bwmethod="cv.ls", ckertype="epanechnikov", ckerorder=2)

It also succeeds with higher kernel orders, such as with ckerorder=4 in this example:

Related Solutions

Local Extrema of Density Function Using Splines in R

What you want to do is called peak detection in chemometrics. There are various methods you can use for that. I demonstrate only a very simple approach here.

require(graphics)
#some data
d <- density(faithful$eruptions, bw = "sj")

#make it a time series
ts_y<-ts(d$y)

#calculate turning points (extrema)
require(pastecs)
tp<-turnpoints(ts_y)
#plot
plot(d)
points(d$x[tp$tppos],d$y[tp$tppos],col="red")

Solved – Kernel density estimation with an Epanechnikov kernel in MATLAB

Here is the rectified code - you basically need to check the matrix multiplications to make sure that the outputs have correct dimensions. for example,

 x=linspace(-1,1,1000); p=ones(n,1); xi=x*p;

will yield xi = 0 and so on. There were also some minor bugs like, setting the proper bandwith, taking the sum at each data points etc, which I think I have fixed. Please see below for the correct code and the resulting figure:

% Kernel density estimation using mixtures of normally distributed random % variables.
% Set parameters for data generating process. 
n=1000; % Number of observations. 
data1 = rand(n,1); % Returns data with 1000 observations.
% Initializing the zero matrix of data from the mixture. 
data2=zeros(n,1);
% % Generate the data. (mixture of normal random variables) 
data2 = (data1 < ones(n,1)/3).*normrnd(2,1,n,1) + ... 
 (data1 > ones(n,1)/3).*normrnd(-2,1,n,1);
% Generating the density (using an Epanechnikov kernel.) 
% Setting the parameters. 
s=std(data2); % standard deviation of the data 
h=1; % bandwidth parameter
% Evaluate the kernel at all x's in the domain. 
x=linspace(-3,3,1000); p=ones(n,1); xi=p*x; 
% matrix for the x's where we evaluate the kernel.
data2i=data2*ones(1,n); % matrix for each of the data points in the kernel. 
u=(xi-data2i)/h; % matrix of u's. 
absu=abs(u); % absolute value of u. 
I=(absu<=1); 
f=(.75/h)*(1-u.^2).*I;
ff = sum(f,1); 
plot(x,ff);

enter image description here

Needless to say, neither code is optimized - which of course was not the intention either.

Best Answer

Related Solutions

Local Extrema of Density Function Using Splines in R

Solved – Kernel density estimation with an Epanechnikov kernel in MATLAB

Related Question