Probability – Calculating the Probability of an Object Belonging to a Group of Objects

maximum likelihoodprobability

Say I have a data set composed of $N$ objects. Each object has a given number of measured values attached to it (three in this case):

$x_1[a_1, b_1, c_1], x_2[a_2, b_2, c_2], …, x_N[a_N, b_N, c_N]$

meaning I have measured the properties $[a_i, b_i, c_i]$ for each $x$ object. The measurement space is thus the space determined by the variables $[a, b, c]$, where each $x$ object is represented by a point in it.

More graphically: I have $N$ objects scattered in a 3D space.

What I need is a way to determine the probability (or likelihood?, is there a difference?) of a new object $y[a_y, b_y, c_y]$ of belonging to this cloud of objects This probability calculated for any $x$ object will of course be very close to 1.

Is this feasible?


Add 1

To address AdamO's question: the object $y$ belongs to a set composed of $M$ mixed objects (with $M$ > $N$). This means that some objects in this set will have a high probability of belonging to the first data set ($N$) and others will have a lower probability. I'm actually interested in these low probability objects.

I can also come up with up to 3 more data sets of $N1$, $N2$, and $N3$ objects, all of them having the same global properties as those in the $N$ data set. Ie: an object in $M$ that has a low probability in $N$ will also have low probabilities when compared with $N1$, $N2$ and $N3$ (and viceversa: objects in $M$ of high probabilities of belonging to $N$ will also display high probabilities of belonging to $N1$, $N2$ and $N3$).


Add 2

According to the answer given in the question Interpretation/use of kernel density I can not derive the probability of a new object belonging to the set that generated the $kde/pdf$ (assuming I would even be able to resolve the equation for a non-unimodal $pdf$) because I have to make the a priori assumption that that new object was generated by the same process that generated the data set from which I obtained the $kde$. Could someone confirm this please?

Best Answer

The first question to address is: what's your distance metric? If you're comfortable with Euclidean space, by all means use that. However, you may want to transform these data onto an orthogonal basis using some kind of SVD and that can be done easily with any statistical software.

Given these data have been transformed into a suitable domain, you can estimate a probability density for these data using some kind of parametric or nonparametric estimation. Roughly normal data are adept to estimation via maximum likelihood, but density smoothers like the boxcar, or (better) a radial basis kernal smoother, will give you an estimate of the probability density ($\hat{f}$) over your domain ($\Omega$).

With these in place, we can evaluate a new observation in terms of its probability from having originated from that distribution. With a new observation taking values $x, y, z$, integrate the probability density over values of the support for which the density is less than the one you observed. This is well behaved for unimodal distributions. That is,

$$\mathcal{F}(x, y, z) = 1-\iiint_{r, s, t :\hat{f}(r, s, t) < \hat{f}(x, y, z)} \hat{f}(r,s,t)\,dr\,ds\,dt $$

This has a direct interpretation like a p-value (very roughly and blending Bayesian / Frequentist ideas): with this point, assuming it is generating from a known distribution (\hat{f}), what's the probability of observing another point as improbable or more improbable given it comes from this distribution? If this value is sufficiently small, we would rule that it is unlikely to have originated from the same distribution, though there is a chance that results in a Type I error.

curve(dnorm, from=-5, to=2)

points(x=-1.8, y=dnorm(-1.8), col='red', pch=20)

polygon(
  x=c(-5, seq(-5, -1.8, by=.1), -1.8),
  y=c(0, dnorm(seq(-5, -1.8, by=.1)), 0),
  col='black'
)

polygon(
  x=c(1.8, seq(1.8, 2, by=.1), 2),
  y=c(0, dnorm(seq(1.8, 2, by=.1)), 0),
  col='black'
)

text(-1.8, dnorm(-1.8), paste("p(this observation | dist'n holds) 
= ", round(pnorm(-1.8)*2, 2)), pos=2)

enter image description here

Related Question