One place to start would be Silverman's nearest-neighbor estimator, but to add in the weights somehow. (I am not sure exactly what your weights are for here.) The nearest neighbor method can evidently be formulated in terms of distances. I believe your first and second nearest neighbor method are versions of the nearest-neighbor method, but without a kernel function, and with a small value of $k$.
Corresponding to any batch of data $X = (x_1, x_2, \ldots, x_n)$ is its "empirical density function"
$$f_X(x) = \frac{1}{n}\sum_{i=1}^{n} \delta(x-x_i).$$
Here, $\delta$ is a "generalized function." Despite that name, it isn't a function at all: it's a new mathematical object that can be used only within integrals. Its defining property is that for any function $g$ of compact support that is continuous in a neighborhood of $0$,
$$\int_{\mathbb{R}}\delta(x) g(x) dx = g(0).$$
(Names for $\delta$ include "atomic" or "point" measure and "Dirac delta function." In the following calculation this concept is extended to include functions $g$ which are continuous from one side only.)
Justifying this characterization of $f_X$ is the observation that
$$\eqalign{
\int_{-\infty}^{x} f_X(y) dy
&= \int_{-\infty}^{x} \frac{1}{n}\sum_{i=1}^{n} \delta(y-x_i)dy \\
&= \frac{1}{n}\sum_{i=1}^{n} \int_{-\infty}^{x} \delta(y-x_i)dy \\
&= \frac{1}{n}\sum_{i=1}^{n} \int_{\mathbb{R}} I(y\le x) \delta(y-x_i)dy \\
&= \frac{1}{n}\sum_{i=1}^{n} I(x_i \le x) \\
&= F_X(x)
}$$
where $F_X$ is the usual empirical CDF and $I$ is the usual characteristic function (equal to $1$ where its argument is true and $0$ otherwise). (I skip an elementary limiting argument needed to move from functions of compact support to functions defined over $\mathbb{R}$; because $I$ only needs to be defined for values within the range of $X$, which is compact, this is no problem.)
The convolution of $f_X(x)$ with any other function $k$ is given, by definition, as
$$\eqalign{
(f_X * k)(x) &= \int_{\mathbb{R}} f_X(x - y) k(y) dy \\
&=\int_{\mathbb{R}} \frac{1}{n}\sum_{i=1}^{n} \delta(x-y-x_i) k(y) dy \\
&= \frac{1}{n}\sum_{i=1}^{n}\int_{\mathbb{R}} \delta(x-y-x_i) k(y) dy \\
&=\frac{1}{n}\sum_{i=1}^{n} k(x_i-x).
}$$
Letting $k(x) = K_h(-x)$ (which is the same as $K_h(x)$ for symmetric kernels--and most kernels are symmetric) we obtain the claimed result: the Wikipedia formula is a convolution.
Best Answer
I guess the problem is that you are using the wrong formula for the Gaussian kernel. The Gaussian kernel uses normal probability density function that has the following form
$$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\tfrac{(x-\mu)^2}{2\sigma^2}} $$
where the distribution with parameters $\mu=0$ and $\sigma^2 = 1$ is called standard normal distribution. The formula you quote resembles it. Gaussian kernel is based on normal density function centered at mean $\mu=0$ and has variance $\sigma^2 = h^2$. So $h$ is the scale parameter (standard deviation) of the kernel, so it serves similar purpose as bandwidth in other kernels, where it controls the "width" of the kernel.
As Dan said, kernels are often defined in terms of "standard" kernel, e.g. standard triangular kernel is $K(x) = (1 - |x|)$ and the scaled kernel $K_h(x) = K(x\,/\,h)\,/\,h$. The standard form of the Gaussian kernel, is the standard normal distribution.
See also the Why does definition of kernel include bandwidth? thread.