Solved – What does “PDF overlap” mean? “To see whether a probability density function overlaps”

density functionlikelihoodterminology

I ran across the following sentence in a journal:
"To see whether a probability density function overlaps"

What does this word mean in the statistics literature, "overlaps"?

Best Answer

As I would interpret it, suppose you have to pdfs, $f_X$ and $f_Y$. The overlap would be the shared area under the curve. Mathematically, you could write this as:

$\displaystyle \int_{-\infty}^{\infty} \min(f_X(t), f_Y(t)) dt$

To help illustrate this, consider the following plot:heavyOverlap

In this image, we have two pdf's, one in blue and one in red. The area shaded in green is the overlap between the two. As you can see, there is fair amount of overlap between the two distributions.

In contrast, consider this scenario where we have two much more distinct pdf's:

lightOverlap

Now there is very little overlap. Why would you care about overlap?

It depends on the situation, but a common reason overlap is important is that when two distributions have very little overlap, observing the value of a random sample will be very informative about which distribution the sample came from, where as if there is heavy overlap, it is often quite uninformative. Consider the first plot: suppose we observed $x = -0.5$. In this case, it was more likely to have come from the blue distribution (that is, assuming it had equal probability of coming from either distribution), but it would not be too uncommon a value from the red distribution. So you can't be that certain that the sample came from the blue distribution. On the other hand, in plot two, this would have been extremely unusual from the red distribution, so you could be very certain that it came from the blue distribution.

This concept can be very important in problems such as classification (if there is little overlap between the categories, you have can high classification accuracy) or the EM algorithm for mixture models (if there is heavy overlap, the algorithm can be very slow because the probability that an observation came from a certain source is very dependent on the probability of that source, rather than the observed value).

Related Question