The process of learning calculus has a tendency to ingrain the notion of the calculation of global extrema as an exercise in differentiation of continuous functions with infinite support, to locate critical points. However, this is only a special case of a much broader class of optimization problems, and the methods that one can use do not always involve such a narrow approach.
If I gave you the following function: $$f(x) = \begin{cases} 1, & 0 < x \le 1/2 \\ 2, & 1/2 < x \le 3/4, \\ 0, & \text{otherwise}, \end{cases}$$ what would you say was the maximum value attainable for $f$? Would you set about answering such a question by taking the derivative, looking for critical points? Of course not; that is absurd.
In a similar vein, would you verify that $f$ is a legitimate density function by integrating $f$? Or would you simply calculate $1(1/2 - 0) + 2(3/4 - 1/2) = 1$?
The tools of calculus are powerful but they are not always the tools of first resort. You have to first think about what it is you are trying to achieve, and choose the most appropriate mathematical method to achieve it. If you reflexively think "differentiation -> critical points -> extrema," you have effectively neglected this initial step.
Here is another example. Suppose a random variable $X$ has the following distribution: $$\Pr[X = 0] = \theta/6, \quad \Pr[X = 1] = \theta/3, \quad \Pr[X = 2] = 1 - \theta/2.$$ You can verify that for any $\theta \in [0,2]$, the above distribution is a legitimate probability mass function.
Now what if I said, "given a single observation from this distribution, what is the MLE of $\theta$?" How would you go about answering such a question?
Well, let's construct the likelihood given a single observation. $$\mathcal L(\theta \mid x) = f(x \mid \theta) = \ldots?$$ Wait, this is weird. What do you write? Well, the probability mass function is $$\Pr[X = x] = \begin{cases} \theta/6, & x = 0 \\ \theta/3, & x = 1 \\ 1 - \theta/2, & x = 2. \end{cases}$$ It's piecewise. So, the likelihood is also piecewise. What this tells you is that $\mathcal L$ is three different functions of $\theta$, depending on the observed outcome $x$. Now the key is to ask, "for each of the three outcomes, what allowable value of $\theta$ can you choose that makes $\mathcal L$ as big as possible?"
For example, suppose we observed $x = 0$. Then $\mathcal L(\theta \mid x = 0) = \theta/6$. You might say, "but I can make this value as big as I want!" Indeed, if you differentiate it with respect to $\theta$, you will find no critical points. But remember, $\theta$ cannot exceed $2$: the reason is that $\Pr[X = 2] = 1 - \theta/2$, and this value, being a probability, cannot be less than $0$. So, in the case where we observed $x = 0$, the maximum likelihood estimate is $\hat \theta = 2$.
If you do the same for the other cases, you will find that for a single observation $x$, $$\hat \theta = \begin{cases} 2, & x \in \{0, 1\} \\ 0, & x = 2. \end{cases}$$ There are no other possibilities.
Note we didn't use calculus at all. We just used basic mathematical reasoning, founded upon a precise understanding of the question at hand.
When it comes to the MLE for a parameter $\theta$ where $U \sim \operatorname{Uniform}(0,\theta)$, the idea is the same: for an IID sample $\boldsymbol u = (u_1, \ldots, u_n)$, we construct the joint density $$f(\boldsymbol u \mid \theta) = \begin{cases} \theta^{-n}, & u_1, \ldots, u_n \in [0, \theta] \\ 0, & \text{otherwise}. \end{cases}$$ We can do better, though, because we note that in the sample $(u_1, \ldots, u_n)$, there must be a smallest element and a largest element, which we will call $u_{(1)}$ and $u_{(n)}$, respectively. Then we can write $$\mathcal L(\theta \mid \boldsymbol u) = f(\boldsymbol u \mid \theta) = \begin{cases} \theta^{-n}, & 0 \le u_{(1)} \le u_{(n)} \le \theta, \\ 0, & \text{otherwise}. \end{cases}$$ And now what? We see that if we regard the sample as being fixed, and $\theta$ the unknown variable in the likelihood, then the likelihood is largest when $\theta$ is as small as permissible. But since the sample is fixed, $\theta$ cannot be smaller than $u_{(n)}$, else the "otherwise" case "kicks in" and the likelihood is zero. But if $\theta$ is larger than $u_{(n)}$, then we are suboptimal, because we're allowing $\theta$ larger than it needs to be, making $\mathcal L$ smaller than it needs to be. Therefore, $$\hat \theta = u_{(n)} = \max(u_1, \ldots, u_n),$$ the maximum observation in the sample.
Welcome back to MSE.
This is one of those things that once you're explained it correctly the first time, without any gaps in explanation, that it makes sense. Unfortunately, most answers and even professors don't explain all of the details, in my experience.
Suppose $X_1, \dots, X_n$ are independent and distributed $\text{Uniform}(0, \theta)$, with $\theta > 0$.
Let $\mathbf{I}$ denote the indicator function, where
$$\mathbf{I}(\cdot) = \begin{cases}
1, & \cdot \text{ is true} \\
0, & \cdot \text{ is false.}
\end{cases}$$
The probability density function of any of the $X_i$, for $i \in \{1, \dots, n\}$, can be written like so:
$$f_{X_i}(x_i \mid \theta) = \dfrac{1}{\theta}\cdot\mathbf{I}(0<x_i<\theta)\text{.}$$
The likelihood function is thus given by
$$\begin{align}
L(\theta)&=f_{X_1, \dots, X_n}(x_1, \dots, x_n \mid \theta)\\
&=\prod_{i=1}^{n}f_{X_i}(x_i \mid \theta) \\
&= \dfrac{1}{\theta^n}\prod_{i=1}^{n}\mathbf{I}(0 < x_i < \theta)\text{.}
\end{align}$$
The following claim, although used, is often omitted from explanations:
Claim. Let $A$ and $B$ be events. Then $\mathbf{I}(A)\cdot \mathbf{I}(B)=\mathbf{I}(A \cap B)$.
I leave the proof of this to you. Note that $ 0 < x_i < \theta$ is the same as requiring both $x_i > 0$ and $x_i < \theta$. Hence, we write
$$\begin{align}
L(\theta)&=\dfrac{1}{\theta^n}\prod_{i=1}^{n}\mathbf{I}(0 < x_i < \theta) \\
&= \dfrac{1}{\theta^n}\prod_{i=1}^{n}[\mathbf{I}(x_i > 0)\mathbf{I}(x_i < \theta)] \\
&= \dfrac{1}{\theta^n}\prod_{i=1}^{n}[\mathbf{I}(x_i > 0)]\prod_{j=1}^{n}[\mathbf{I}(x_j < \theta)]\text{.}
\end{align}$$
It will be clear why I split the product as above in a bit.
The claim given above is true if we were to extend to an arbitrary number of events as well. Thus,
$$\prod_{i=1}^{n}[\mathbf{I}(x_i > 0)] = \mathbf{I}(x_1 > 0 \cap x_2 > 0 \cap \cdots \cap x_n > 0)$$
and
$$\prod_{j=1}^{n}[\mathbf{I}(x_j < \theta)] = \mathbf{I}(x_1 < \theta \cap x_2 < \theta \cap \cdots \cap x_n < \theta)\text{.}$$
The next claims are often omitted as well from explanations:
Claim 1. Given $x_1, \dots, x_n \in \mathbb{R}$, $x_1, \dots, x_n < k$ if and only if $$x_{(n)}:=\max_{1 \leq i \leq n}x_i < k\text{.}$$
Claim 2. Given $x_1, \dots, x_n \in \mathbb{R}$, $x_1, \dots, x_n > k$ if and only if $$x_{(1)}:=\min_{1 \leq i \leq n}x_i > k\text{.}$$
Thus
$$\prod_{i=1}^{n}[\mathbf{I}(x_i > 0)] = \mathbf{I}(x_1 > 0 \cap x_2 > 0 \cap \cdots \cap x_n > 0) = \mathbf{I}(x_{(1)} > 0)$$
and
$$\prod_{j=1}^{n}[\mathbf{I}(x_j < \theta)] = \mathbf{I}(x_1 < \theta \cap x_2 < \theta \cap \cdots \cap x_n < \theta) = \mathbf{I}(x_{(n)} < \theta)\text{.}$$
The likelihood function is thus
$$L(\theta) = \dfrac{1}{\theta^n}\mathbf{I}(x_{(1)} > 0)\mathbf{I}(x_{(n)} < \theta)\text{.}\tag{*}$$
Now, consider the above as a function of $\theta$. For all intents and purposes, $\mathbf{I}(x_{(1)} > 0)$ is irrelevant when it comes to maximization of $L$ with respect to $\theta$, because it is independent of $\theta$. So, the part that really matters is
$$L(\theta) \propto \dfrac{1}{\theta^n}\mathbf{I}(x_{(n)} < \theta) = \dfrac{1}{\theta^n}\mathbf{I}(\theta > x_{(n)})\text{.}\tag{**}$$
Generally, when doing maximum-likelihood estimation, we assume that the observed $x_i$ fall within the support of the given distribution, so we'll just assume $x_{(1)} > 0$.
Remember to view (**) as a function of $\theta$. If $\theta \leq x_{(n)}$, note that $L(\theta) = 0$ because of the indicator function. This is not the maximized value of $L$; $L$ is, at its crux, a probability density function: $0$ is in fact the smallest value that a probability density function can take.
So, in attempting to maximize $L$, suppose that $\theta > x_{(n)}$. For $n$ fixed, we obtain
$$L(\theta) \propto\dfrac{1}{\theta^n}\text{.}$$
Now, note that $\dfrac{1}{\theta^n}$ is indeed a decreasing function of $\theta$ with $n$ fixed. Thus, we must make $\theta$ as small as possible, given our restriction of $\theta > x_{(n)}$.
Note. Technically, no such $\theta$ exists (because $\theta$ is strictly greater than $x_{(n)}$ per our assumptions). This is often ignored in many textbooks.
Most textbooks will then say that the maximum likelihood estimator of $\theta$ is
$$\hat{\theta}_{\text{MLE}} = X_{(n)}\text{.}$$
Note. Technically, the above result is false. The MLE does not exist, because $\theta$ cannot take on the value $x_{(n)}$ itself. For this answer to be correct, the support of the uniform PDF must include $\theta$ itself (because the maximum likelihood estimator equals one of the $X_i$). The reason for this is discussed in the Lecture 2: Maximum Likelihood Estimators from MIT OpenCourseWare 18-443 Statistics for Applications, found here. As the question currently stands, $(0, \theta)$ should be $(0, \theta]$.
Best Answer
One important thing that you are probably missing is that $f(x;\alpha,\beta)=0$ when $x\notin [\alpha,\alpha+\beta]$. So, for $L$ to take positive value for your selection of $\alpha,\beta$, you need to have that $\alpha\leq X_{\min}\leq X_{\max}\leq \alpha+\beta$. I'm pretty sure you then saw that $L$ strictly decreasing in $\beta$. This, together with the restriction that $\beta\geq X_{\max}-\alpha$ should make you conclude that optimally $\beta=X_{\max}-\alpha$ (rather than $\beta=X_{\max}-X_{\min}$ which you mention).
You then need to substitute this value into your expression for $L$ and then proceed to maximise the resulting expression with respect to $\alpha$.