Max Likelihood Estimation – How to Perform Maximum Likelihood Fitting of Truncated, Mixed, Two Population Systems

maximum likelihoodpython

this is a repost from math.stackexchange (I didn't know how to migrate the post, apologies). I was informed that this would be a better place to ask.


TLDR: I am trying to do maximum likelihood fitting of a dataset having two mixed populations, observed over a subset of their parameter space, within it to two pdfs. I include working code with gaussian examples.


My question (and example code) is also long, so I will host some information in pastebin. I apologize, I am neither a mathematician nor a computer scientist, so some things may be ugly or inefficient.

I have a set of data which has samples of two populations within it (we have a good prior of what percentage of the data belongs to each population). Neither sample probes the full spatial extent of the data, but the probability density function (pdf) is presumed to be smooth. I use a wide variety of pdfs, but for this discussion I will utilize only Gaussians for simplicity– however, I manually normalize the Gaussians, since my pdfs are not normalized.

We can think of the data as being sampled from two different Gaussians, with some kind of cut off boundaries, beyond which the population continues, but the sampling ends. The boundaries of the two populations intersect for a range.


The metacode would be something along the lines of:
A) Make a dataset by sampling two pdfs with known parameters, within two different given ranges.

$$ pdf = gaussian = e^{\frac{-1}{2}(\frac{x-mu}{sigma})^{2}} $$

B) Guess some parameters

–1) Find the normalizing constants ($n_{1}$ and $n_{2}$) for these parameters by integrating the Gaussians of this guess analytically over the observed space.
$$ n = \sum{volume*density} $$
$$ density = pdf(x) :: volume = step size $$

–2) For each item in the dataset:
$$ Probability (p)= \frac{ratio}{n_{1}}*pdf_{1}(item) + \frac{1-ratio}{n_{2}}*pdf_{2}(item) $$

–3) Sum the log likelihoods for all data to find the total likelihood of the guess
$$ L = \sum{log(p)} $$

C) Use some likelihood landscape explorer to converge on a best answer (here I use brute grid for simplicity)


I have had great success disentangling two populations if they are sampled in their entirety, but poor success disentangling two populations that have been truncated (the answers are frequently nonsensical and boundary dominated).

The code (~100 lines, but not dense code) will run as is. It follows the metacode above and successfully finds the parameters of the two Gaussians sampled over (essentially) their entire area. However, changing lines 31 and 32 to:
range1=[-2,6]
range2=[2,12]
for example, causes the algorithm to find the incorrect values. My question is why.

Best Answer

I've been working on this for a while now, and the (obvious) answer popped into my head somewhere in the middle of a BSG episode so I return to drop it off in the archive for anyone with future interest.

My issue was that my calculation of the probability density after the normalizations was incorrect and not accounting for the observing ranges. B2) in the metacode above should read:

$$Probability(p) = \frac{ratio}{n1}*pdf_{1}(item) + \frac{1.-ratio}{n2}*pdf_{2}(item)$$ $$pdf_{1} = 0 || item \notin range_{1}$$ $$pdf_{2} = 0 || item \notin range_{2}$$

It seems obvious in retrospect.

I post new code which will perform the task proposed. Two data sets of different sizes are produced from differing input Gaussian Parameters; they are truncated to be observed only over certain (different) ranges and then combined. Guesses of these parameters are generated; the gaussians of these guesses are normalized over the observed parameter space manually (as an example for unknown normalizer functions); the likelihood of these guesses being the model from which the data are drawn are calculated by adding the log likelihoods of the *truncated* PDF.

Related Question