Function proportional to the log likelihood for the Gaussian distribution

log likelihoodnormal distributionprobabilitystatistical-inferencestatistics

The following question has been crossposted to CrossValidated upon recommendation from the community and a lack of responses here.

Consider the following problem from a course on statistical inference:

If we generate a sample $x_i$ for $i \in$ {$1 … n$} from $ p(x_i) = \sum_{k=1}^2 w_kp(x_i| \mu _k, \sigma^2_k)$
where $p(x_i| \mu _k, \sigma ^2_k)$ are Gaussian densities and $w_2 = 1 – w_1$ (where we assume that all the parameters are unknown)

Find the log-likelihood function for the sample $x_{1:n}$.

The log-likelihood is defined as:

$$ l( \theta | x_{1:n}) = \log(p(x_{1:n})) = \log( \Pi _{i=1}^n p(x_i)) = \sum _{i=1}^n \log(p(x_i))$$

Now substituting in for $p(x_i)$ we get:

$$ l( \theta | x_{1:n}) = \sum_{i=1}^n \log \Big{(} \sum_{k=1}^2 w_k p(x_i | \mu _k , \sigma^2_k) \Big{)} $$

We can now substitute in the Gaussian densities which gives us:

$$ l( \theta | x_{1:n}) = \sum_{i=1}^n \log \Big{[} w_1(2 \pi \sigma ^2 _1)^{-1/2} \exp \Big{(}\frac{(x – \mu _1)^2}{2 \sigma ^2 _1} \Big{)} + w_2 (2 \pi \sigma ^2 _2)^{-1/2} \exp \Big{(}\frac{(x – \mu _2)^2}{2 \sigma ^2 _2} \Big{)} \Big{]} $$

The following solution aligns with my working, however, there is a proportionality relation on the last line that is unclear to me.

Why does the final line hold? I can't see a clear and obvious way to get from the point at which I have substituted the Gaussian densities into the sum to it being proportional to the given double summation.

I understand that we can exclude any multiplicative constants, however, it seems as though the summation has been taken out of the logarithm at some stage in order to derive a proportionality relation. Although I'm not sure if this holds and if it does, why does it hold?

Best Answer

I am quite certain that the formula on the grey background is false. This would mean that the log-likelihood of a single sample $x$ (i.e. the log of the density of the mixture of two gaussians) is a polynomial in $x$ of degree two, i.e. that the mixture of two gaussians is a gaussian. This is obviously false, see https://www.wolframalpha.com/input?i=plot+ln%28%281%2F2%29+e%5E%28-%28x%2B3%2F2%29%5E2%29+%2B+%281%2F2%29+e%5E%28-%28x-3%2F2%29%5E2%29%29+for+x%3D-4..4

Note that the Expectation-Maximization (EM) algorithm – which is where the formula in the grey background seems to come from, as is somewhat visible in the CrossValidated post – avoids this (and keeps simple formulae) by first estimating from which gaussian the sample was sampled before estimating the log-likelihood of the sample together with the number of the gaussian it was sampled from. Let us try to find out what this means.

To find a formula similar to what you are looking for, define first $z_1, \dots, z_n$ to be the number of the gaussian from which you sampled $x_1, \dots, x_n$. Note that $z_i$ is not present in the data. The log-likelihood of the joint variable $(z_1,x_1)$ is \begin{align} \mathcal L\mathcal L(z, x) := \ln w_{z} + \ln p(x | \mu_z, \sigma^2_z) ) = \ln w_z - \ln(\sqrt{2\pi\sigma_z^2}) - \frac{(x-\mu_z)^2}{2\sigma_z^2} . \qquad (1) \end{align} On the other hand, $$ \mathbb P(z_1 = k | x_1) = \frac{w_k p(x_1 | \mu_k, \sigma_k^2)}{\sum_{k’} w_{k’} p(x_1 | \mu_{k’}, \sigma_{k’}^2)} . $$ You are given $x_1, \dots, x_n$ generated according to an unknown mixture of gaussians. If, somehow, you knew $\mathbb P(z_1 = k | x_1)$ exactly, then you could generate the $z_i$ to obtain a sample $(z_i, x_i)$ that has the same distribution as the original data. Unfortunately, $\mathbb P(z_1 = k | x_1)$ is unknown because the $\mu_i, \sigma_i^2$ are unknown.

Here comes the trick: the EM algorithm assumes you have pretty good estimators for the $\mu_k, \sigma_k^2$, so you have a pretty good estimate of $\mathbb P(z_1 = k | x_1)$. So you can generate $z_i$ in this manner, then compute the log-likelihood of the joint series $(z_i, x_i)$ which has a very simple expression $(1)$.

In fact, sampling the $z_i$ is not needed: you can compute the expected log-likelihood you would obtain in this manner (conditionally on the $x_i$), which is given by $$ \sum_{k_1, \dots, k_n} \sum_i \mathcal L\mathcal L(k_i, x_i) \mathbb P(z_i = k_i | x_i) . $$ This is where every formula with the sum outside the log ([2] in the CrossValidated post) come from, and as you can see, they do not actually give the true log-likelihood.

Note 1: There is a slight difference in [2], where $\ln(\sqrt{2\pi\sigma_z^2})$ is replaced by $\frac{1}{2}\ln(\sigma_z^2)$. This is okay because we do not care about additive constants.

Note 2: To be convinced of just how sketchy our reasoning was, be aware that the EM algorithm, which we derived to try and find the maximum likelihood estimator (MLE) via a couple of heuristics, might fail to converge to the MLE, see these lecture notes.

Best Answer

Related Solutions

Prove Negative Log Likelihood for Gaussian Distribution is Convex – Optimization Guide

Related Question