What is the meaning of Joint maximization of the likelihood function

machine learningmaximum likelihoodnormal distributionoptimization

I am reading the book Pattern Recognition and Machine Learning by Christopher M. Bishop.

I am currently in section 1.2.4 Gaussian Distribution.

The normal distribution is given by:

$$ N(x|u,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{1}{2\sigma^2}(x-u)^2} \hspace{1cm} (1.1)$$

and the likelihood of x is given by simply plugging x in the normal distribution. Then we assume that we have N data set, and we don't know the mean nor the variance of the distribution. Since the data set are independently and identically distributed then their likelihood function is given by:

$$ p(\textbf{x} | u,\sigma^2) = \prod_{n=1}^N N(x_n|u,\sigma^2) \hspace{1cm}(1.2)$$
where x is the data set.

Basically, for a given set of data we are trying to maximize the likelihood function by finding the optimal values of the mean and variance.

This is all clear until now. However, a little bit later the author says we are performing a joint maximization with respect to the mean and variance (1.2) (to find their values). I just don't get what he means by joint maximization. I tried searching the term online and got completely different results.

What I am guessing that he means is we take the gradient of (1.2) with respect to the mean and the variance and solve for when the gradient vector is equal to zero. We can simplify our procedure by simply taking the log of the function.

Best Answer

"Joint maximization" simply means that you are searching for the choice of joint parameters $\mu$ and $\sigma^2$ for which the likelihood attains a global maximum.

In multivariable calculus, for instance, suppose you have the function $$f(x,y) = \frac{1}{x^2 + y^2 + 1}.$$ You could maximize $f$ with respect to $x$ only, or to $y$ only; or you could maximize $f$ with respect to both $x$ and $y$, in which case you would find that the global maximum is attained when $(x,y) = (0,0)$. But if you maximized with respect to $x$ only, you would find $f(0,y) = 1/(y^2 + 1)$ is the result. It's not a global maximum because if $y = 1$, for instance, then $f(0,1) = 1/2$ is the maximum subject to this constraint.

I would also like to remark that the author's use of notation is sloppy and probably an indication of low quality writing. This sloppiness tends to be more common among those who approach statistics through applications as opposed to theory. For instance, the use of the notation $N(x \mid \mu, \sigma^2)$ is sloppy: this is a density and should be written as $f_X(x)$ or $f_X(x \mid \mu, \sigma^2)$, or if $X$ is understood, it may be written $f(x)$ or $f(x \mid \mu, \sigma^2)$. $N$ is not appropriate and conflates density functions with probability distributions.

Another example is how the likelihood is characterized. The proper notation is to write something like $p(\mu, \sigma^2 \mid \boldsymbol x)$ or $\mathcal L(\mu, \sigma^2 \mid \boldsymbol x)$. This is because the likelihood is regarded as a function of the parameter(s) (in this case, $\mu$ and $\sigma^2$), given a sample (in this case, $\boldsymbol x$).

Related Question