[Math] What are strategies for maximizing functions–specifically for finding the MLE–without using the derivative (statistics)

maximum likelihoodprobabilitystatistics

I know how to use the MLE, just I run into trouble when I can't maximize the likelihood function with its derivative.

As an example, I know that the MLE for a uniform distribution with parameter $B$ is $\hat{B}=\max(X)$ where $X$ is the observed data. I do not understand why though.

Any advice?

Best Answer

The process of learning calculus has a tendency to ingrain the notion of the calculation of global extrema as an exercise in differentiation of continuous functions with infinite support, to locate critical points. However, this is only a special case of a much broader class of optimization problems, and the methods that one can use do not always involve such a narrow approach.

If I gave you the following function: $$f(x) = \begin{cases} 1, & 0 < x \le 1/2 \\ 2, & 1/2 < x \le 3/4, \\ 0, & \text{otherwise}, \end{cases}$$ what would you say was the maximum value attainable for $f$? Would you set about answering such a question by taking the derivative, looking for critical points? Of course not; that is absurd.

In a similar vein, would you verify that $f$ is a legitimate density function by integrating $f$? Or would you simply calculate $1(1/2 - 0) + 2(3/4 - 1/2) = 1$?

The tools of calculus are powerful but they are not always the tools of first resort. You have to first think about what it is you are trying to achieve, and choose the most appropriate mathematical method to achieve it. If you reflexively think "differentiation -> critical points -> extrema," you have effectively neglected this initial step.

Here is another example. Suppose a random variable $X$ has the following distribution: $$\Pr[X = 0] = \theta/6, \quad \Pr[X = 1] = \theta/3, \quad \Pr[X = 2] = 1 - \theta/2.$$ You can verify that for any $\theta \in [0,2]$, the above distribution is a legitimate probability mass function.

Now what if I said, "given a single observation from this distribution, what is the MLE of $\theta$?" How would you go about answering such a question?

Well, let's construct the likelihood given a single observation. $$\mathcal L(\theta \mid x) = f(x \mid \theta) = \ldots?$$ Wait, this is weird. What do you write? Well, the probability mass function is $$\Pr[X = x] = \begin{cases} \theta/6, & x = 0 \\ \theta/3, & x = 1 \\ 1 - \theta/2, & x = 2. \end{cases}$$ It's piecewise. So, the likelihood is also piecewise. What this tells you is that $\mathcal L$ is three different functions of $\theta$, depending on the observed outcome $x$. Now the key is to ask, "for each of the three outcomes, what allowable value of $\theta$ can you choose that makes $\mathcal L$ as big as possible?"

For example, suppose we observed $x = 0$. Then $\mathcal L(\theta \mid x = 0) = \theta/6$. You might say, "but I can make this value as big as I want!" Indeed, if you differentiate it with respect to $\theta$, you will find no critical points. But remember, $\theta$ cannot exceed $2$: the reason is that $\Pr[X = 2] = 1 - \theta/2$, and this value, being a probability, cannot be less than $0$. So, in the case where we observed $x = 0$, the maximum likelihood estimate is $\hat \theta = 2$.

If you do the same for the other cases, you will find that for a single observation $x$, $$\hat \theta = \begin{cases} 2, & x \in \{0, 1\} \\ 0, & x = 2. \end{cases}$$ There are no other possibilities.

Note we didn't use calculus at all. We just used basic mathematical reasoning, founded upon a precise understanding of the question at hand.

When it comes to the MLE for a parameter $\theta$ where $U \sim \operatorname{Uniform}(0,\theta)$, the idea is the same: for an IID sample $\boldsymbol u = (u_1, \ldots, u_n)$, we construct the joint density $$f(\boldsymbol u \mid \theta) = \begin{cases} \theta^{-n}, & u_1, \ldots, u_n \in [0, \theta] \\ 0, & \text{otherwise}. \end{cases}$$ We can do better, though, because we note that in the sample $(u_1, \ldots, u_n)$, there must be a smallest element and a largest element, which we will call $u_{(1)}$ and $u_{(n)}$, respectively. Then we can write $$\mathcal L(\theta \mid \boldsymbol u) = f(\boldsymbol u \mid \theta) = \begin{cases} \theta^{-n}, & 0 \le u_{(1)} \le u_{(n)} \le \theta, \\ 0, & \text{otherwise}. \end{cases}$$ And now what? We see that if we regard the sample as being fixed, and $\theta$ the unknown variable in the likelihood, then the likelihood is largest when $\theta$ is as small as permissible. But since the sample is fixed, $\theta$ cannot be smaller than $u_{(n)}$, else the "otherwise" case "kicks in" and the likelihood is zero. But if $\theta$ is larger than $u_{(n)}$, then we are suboptimal, because we're allowing $\theta$ larger than it needs to be, making $\mathcal L$ smaller than it needs to be. Therefore, $$\hat \theta = u_{(n)} = \max(u_1, \ldots, u_n),$$ the maximum observation in the sample.