Maximum Likelihood Estimator – Meaning of Invariance

estimatorsinvariancemathematical-statisticsmaximum likelihoodprofile-likelihood

In Casella-Berger, the invariance of MLE is defined as:

Assuming that $\hat{\theta}$ is MLE of $\theta$, then for any function
$\tau$, $\tau(\hat{\theta})$ is MLE of $\tau(\theta)$.

In the case of a one-to-one transformation, everything is clear.

But what is the point of having $\tau$ not one-to-one? In C&B, there is even a proof of this fact.
However, the authors defined a NEW LIKELIHOOD, called "induced likelihood function" (profile likelihood), which is defined as

$$\mathcal{L}^*(\eta|x) = \sup_{\{\theta: \tau(\theta) = \eta\}}\mathcal{L}(\theta|x)$$

Sure, we can prove that $\tau(\hat{\theta})$ maximizes the profile likelihood above. But this is NOT THE SAME LIKELIHOOD as in the original problem!

Example: imagine $f(x|\theta) = normal(\theta, 1)$, and a transformation $\tau(\theta) = \theta^2$. Ok, $\tau(\hat{\theta})$ maximises the profile likelihood. Now what?

I guess I would summarize my question:

  1. What is the point of the whole theorem in the case of non-bijective functions $\tau$.

  2. Assuming $\eta$ maximizes the profile likelihood, what does it tell me about the original likelihood? E.g. in the example above, I cannot convert $\eta = \theta^2$ back into $\theta$.

Best Answer

Casella and Berger (2002) explain this by saying that, "In many cases, this simple version of the invariance of MLEs is not useful because many of the functions we are interested in are not one-to-one" (p. 320). This is the essence of their motivation for extending the concept of the likelihood function. If you have an existing sampling density framed in terms of a minimal sufficient parameter, and you want an estimate of a non-injective function of that parameter (as in your example) then the latter quantity is not a valid parametric to describe the sampling density.

One of the examples they give (on p. 321) which is quite instructive, is the case where you have binomial data $X \sim \text{Bin}(n,p)$ and you want to make an inference about the standard deviation $\mathbb{S}(X) = \sqrt{np(1-p)}$. This is a valuable quantity of interest, but it is a non-injective function of the parameter $p$. (In particular, we get the same standard deviation for $p$ and $p' = 1-p$.) This is the kind of situation where it is useful to find the MLE for a non-injective function of a parameter.


This is not so different from standard MLE analysis: The situation in your example is not actually any different from the case where we have a distribution that depends on two parameters, and we only want an MLE for one of those parameters. In fact, you can re-frame it directly in those terms. Taking your example, suppose we reframe the sampling density in terms of the two parameters $\phi \geqslant 0$ and $\lambda \in \{ -1, 1\}$ and we use the sampling density:

$$f(x|\phi, \lambda) = \text{N}(x|\phi \lambda, 1) = \frac{1}{\sqrt{2 \pi}} \exp \Big( -\frac{(x-\phi \lambda)^2}{2} \Big),$$

and the corresponding likelihood function:

$$L_x(\phi, \lambda) \propto \exp \Big( -\frac{(x-\phi \lambda)^2}{2} \Big).$$

This new representation is just as valid as the original representation in terms of $\theta = \phi \lambda$, and all that we have done is to split this parameter into two parameters that capture aspects of its behaviour. In the new form, the parameter $\tau = \theta^2 = \phi^2$ is a one-to-one function of our parameter $\phi$, so we can unambiguously re-frame our likelihood function as:

$$L_x(\tau, \lambda) \propto \exp \Big( -\frac{(x-\sqrt{\tau} \lambda)^2}{2} \Big).$$

Conseqeuently, the MLE for the parameter $\tau$ is any value that satisfies $L_x(\hat{\phi}_\text{MLE}, \hat{\lambda}_\text{MLE}) \geqslant L_x(\phi, \lambda)$ for all valid $\phi$ and $\lambda$. It can easily be shown that the maximised likelihood corresponds to:

$$\begin{align} L_x(\hat{\phi}_\text{MLE}, \hat{\lambda}_\text{MLE}) = \sup_{\lambda \in \{-1,1 \} } L_x(\hat{\phi}_\text{MLE}, \lambda) = L_x^*(\hat{\phi}_\text{MLE}), \\[6pt] \end{align}$$

using the "induced likelihood" defined in Casella and Berger. So, you can see that you are already using this technique when you look at the MLE of a parameter in a context where you have a model that depends on multiple parameters. The situation is not fundamentally different if you have data $X \sim \text{N}(\theta, \sigma^2)$ and you want the MLE for $\theta$, while treating $\sigma$ as a nuisance parameter. In each case the MLE of the parameter of interest maximises the "induced likelihood".