Solved – Ratio of probabilities vs ratio of PDFs

bayesiankernel-smoothingmaximum likelihoodprobability

I'm using Bayes to solve a clustering problem. After doing some calculations I end up with the need to obtain the ratio of two probabilities:

$$P(A)/P(B)$$

to be able to obtain $P(H|D)$. These probabilities are obtained by integration of two different 2D multivariate KDEs as explained in this answer:

$$P(A) = \iint_{x, y : \hat{f}(x, y) < \hat{f}(r_a, s_a)} \hat{f}(x,y)\,dx\,dy$$
$$P(B) = \iint_{x, y : \hat{g}(x, y) < \hat{g}(r_b, s_b)} \hat{g}(x,y)\,dx\,dy$$

where $\hat{f}(x, y)$ and $\hat{g}(x, y)$ are the KDEs and the integration is done for all points below the thresholds $\hat{f}(r_a, s_a)$ and $\hat{g}(r_b, s_b)$. Both KDEs use a Gaussian kernel. A representative image of a KDE similar to the ones I'm working with can be seen here: Integrating kernel density estimator in 2D.

I calculate the KDEs by means of a python function stats.gaussian_kde, so I assume the following general form for it:

$$KDE(x,y) = \frac{1}{n} \sum_{i=1}^{n} -\frac{1}{2h^2} e^{-\frac{(x-x_i)^2 + (y-y_i)^2}{2h^2}}$$

where n is the length of my array of points and h is the bandwidth used.

The integrals above are calculated applying a Monte Carlo process which is quite computationally expensive. I've read somewhere (forgot where, sorry) that in cases like this it is possible to replace the ratio of probabilities by the ratio of PDFs (KDEs) evaluated at the threshold points to obtain equally valid results. I'm interested in this because computing the KDEs ratio is orders of magnitude faster than calculating the ratio of the integrals with MC.

So the question is reduced to the validity of this expression:

$$\frac{P(A)}{P(B)} = \frac{\hat{f}(r_a, s_a)}{\hat{g}(r_b, s_b)}$$

Under which circumstances, if any, can I say that this relation is true?

[fixed typo (EDIT)]

Add:

Here's basically the same question but made in a more mathematical form.

Best Answer

The KDE is a mixture of Normal distributions. Let's look at a single one of them.

The definitions of $P(A)$ and $P(B)$ show their values are invariant under translations and rescalings in the plane, so it suffices to consider the standard Normal distribution with PDF $f$. The inequality

$$f(x,y) \le f(r,s)$$

is equivalent to

$$x^2 + y^2 \ge r^2 + s^2.$$

Introducing polar coordinates $\rho, \theta$ allows the integral to be rewritten

$$P(r,s) = \frac{1}{2\pi}\int_0^{2\pi}\int_\sqrt{r^2+s^2}^\infty \rho \exp(-\rho^2/2) d\rho d\theta= \exp(-(r^2+s^2)/2) = 2\pi f(r,s).$$

Now consider the mixture. Because it is linear,

$$\eqalign{ P(r,s) &= \frac{1}{n}\sum_i 2\pi f((r-x_i)/h, (s-y_i)/h) \\ &= 2\pi h^2\left(\frac{1}{n}\sum_i \frac{1}{h^2} f((r-x_i)/h, (s-y_i)/h)\right) \\ &=2\pi h^2 KDE(r,s). }$$

Indeed, $f$ and $P$ are proportional. The constant of proportionality is $2\pi h^2$.

That such a proportionality relationship between $P$ and $f$ is special can be appreciated by contemplating a simple counterexample. Let $f_1$ have a uniform distribution on a measurable set $A_1$ of unit area and $f_2$ have a uniform distribution on a measurable set $A_2$ which is disjoint from $A_1$ and has area $\mu\gt 1$. Then the mixture with PDF $f=f_1/2 + f_2/2$ has constant value $1/2$ on $A_1$, $1/(2\mu)$ on $A_2$, and is zero elsewhere. There are three cases to consider:

$(r,s)\in A_1$. Here $f(r,s)=1/2$ attains its maximum, whence $P(r,s)=1$. The ratio $f(r,s)/P(r,s) = 1/2$.
$(r,s)\in A_2$. Here $f(r,s)$ is strictly less than $1/2$ but greater than $0$. Thus the region of integration is the complement of $A_1$ and the resulting integral must equal $1/2$. The ratio $f(r,s)/P(r,s) = (1/(2\mu))/(1/2) = 1/\mu$.
Elsewhere, $f$ is zero and the integral $P$ is zero.

Evidently the ratio (where it is defined) is not constant and varies between $1$ and $1/\mu \ne 1$. Although this distribution is not continuous, it can be made so by adding a Normal$(0,\Sigma)$ distribution to it. By making both eigenvalues of $\Sigma$ small, this will change the distribution very little and produce qualitatively the same results--only now the values of the ratio $f/P$ will include all the numbers in the interval $[1,1/\mu]$.

This result also does not generalize to other dimensions. Essentially the same calculation that started this answer shows that $P$ is an incomplete Gamma function and that clearly is not the same as $f$. That two dimensions are special can be appreciated by noting that the integration in $P$ essentially concerns the distances and when those are Normally distributed, the distance function has a $\chi^2(2)$ distribution--which is the exponential distribution. The exponential function is unique in being proportional to its own derivative--whence the integrand $f$ and integral $P$ must be proportional.

R Code

The algorithm is contained in the half dozen lines of the first function, f. To illustrate its use, the rest of the code generates the preceding figures.

library(MASS)     # kde2d
library(spatstat) # im class
f <- function(xy, n, x, y, ...) {
  #
  # Estimate the total where the density does not exceed that at (x,y).
  #
  # `xy` is a 2 by ... array of points.
  # `n`  specifies the numbers of rows and columns to use.
  # `x` and `y` are coordinates of "probe" points.
  # `...` is passed on to `kde2d`.
  #
  # Returns a list:
  #   image:    a raster of the kernel density
  #   integral: the estimates at the probe points.
  #   density:  the estimated densities at the probe points.
  #
  xy.kde <- kde2d(xy[1,], xy[2,], n=n, ...)
  xy.im <- im(t(xy.kde$z), xcol=xy.kde$x, yrow=xy.kde$y) # Allows interpolation $
  z <- interp.im(xy.im, x, y)                            # Densities at the probe points
  c.0 <- sum(xy.kde$z)                                   # Normalization factor $
  i <- sapply(z, function(a) sum(xy.kde$z[xy.kde$z < a])) / c.0
  return(list(image=xy.im, integral=i, density=z))
}
#
# Generate data.
#
n <- 256
set.seed(17)
xy <- matrix(c(rnorm(k <- ceiling(2*n * 0.8), mean=c(6,3), sd=c(3/2, 1)), 
               rnorm(2*n-k, mean=c(2,6), sd=1/2)), nrow=2)
#
# Example of using `f`.
#
y.probe <- 1:6
x.probe <- rep(6, length(y.probe))
lims <- c(min(xy[1,])-15, max(xy[1,])+15, min(xy[2,])-15, max(xy[2,]+15))
ex <- f(xy, 200, x.probe, y.probe, lim=lims)
ex$density; ex$integral
#
# Compare the effects of raster resolution and bandwidth.
#
res <- c(8, 40, 200, 1000)
system.time(
  est.0 <- sapply(res, 
           function(i) f(xy, i, x.probe, y.probe, lims=lims)$integral))
est.0
system.time(
  est.1 <- sapply(res, 
           function(i) f(xy, i, x.probe, y.probe, h=1, lims=lims)$integral))
est.1
system.time(
  est.2 <- sapply(res, 
           function(i) f(xy, i, x.probe, y.probe, h=1/2, lims=lims)$integral))
est.2
system.time(
  est.3 <- sapply(res, 
           function(i) f(xy, i, x.probe, y.probe, h=5, lims=lims)$integral))
est.3
results <- data.frame(Default=est.0[,4], Hp5=est.2[,4], 
                      H1=est.1[,4], H5=est.3[,4])
#
# Compare the integrals at the highest resolution.
#
par(mfrow=c(1,1))
panel <- function(x, y, ...) {
  points(x, y)
  abline(c(0,1), col="Red")
}
pairs(results, lower.panel=panel)
#
# Display two of the density estimates, the data, and the probe points.
#
par(mfrow=c(1,2))
xy.im <- f(xy, 200, x.probe, y.probe, h=0.5)$image
plot(xy.im, main="Bandwidth=1/2", col=terrain.colors(256))
points(t(xy), pch=".", col="Black")
points(x.probe, y.probe, pch=19, col="Red", cex=.5)

xy.im <- f(xy, 200, x.probe, y.probe, h=5)$image
plot(xy.im, main="Bandwidth=5", col=terrain.colors(256))
points(t(xy), pch=".", col="Black")
points(x.probe, y.probe, pch=19, col="Red", cex=.5)

Solved – Kernel density estimation bandwidth selection

There is no reason to choose the optimal bandwith because, as you said, your data does not meet the assumption under which it is derived.

This is a parameter selection problem and it is very easy to overfit using kde and thus end up with a lot of peaks. If you are having multiple peaks then chances are that you are not modeling the true distribution but only your dataset. If you really want to use kde then choose a larger bandwith so that you get 2 peaks, no need for post processing.

But apparently you have some prior knowledge about your distribution and that you should use. If your distribution has 2 peaks that are both due to 2 gaussians then I suggest you use MLE for mixture of gaussians, it will probably give you better results.

PS: I would have comment before posting an answer but I can't ...

Best Answer

Related Solutions

Probability – Integrating Kernel Density Estimator in 2D: A Detailed Explanation

R Code

Solved – Kernel density estimation bandwidth selection

Related Question