Your approach is correct.
By differentiating with respect to $\beta_2$, we can see that at the optimal value, we must have
$$\hat{\beta}_2 = y_n -\hat{\beta_1}x_n-\hat{\beta_0}$$
That is the last term of the objective function must vanish.
Hence the problem to solve for $\hat{\beta_0}$ and $\hat{\beta_1}$ is the same as minimizing
$$\sum_{i=1}^{n-1} (y_i-\beta_0-\beta_1 x_i)^2$$
Hence, we know that $\hat{\beta_1}=\hat{\alpha_1}$ and furthermore, $\hat{\beta_0}=\hat{\alpha_0}$.
Let $\lambda_i = \lambda,$ for $i = 1,2,3.$
In this special case, $E(\max_i(X_i)),$ for $X_i \stackrel{iid}{\sim} \mathsf{Exp}(rate = \lambda)$ can be found
as follows:
Consider the $X_i$ to be times to failure of three devices. The time to failure of the first device is $\min_i(X_i) = X_{(1)} \sim \mathsf{Exp}(3\lambda),$ with $E(X_{(1)}) = 1/3\lambda.$
Then, by the no-memory property, the additional time to failure of the second device is
$D_2 = X_{(2)}-X_{(1)} \sim \mathsf{Exp}(2\lambda),$ with $E(D_2) = 1/2\lambda.$ This is the average minimum time to failure of remaining two devices.
Similarly, the additional time to failure $D_3$ of the (single remaining) third device has $E(D_2) = 1/\lambda.$
Thus the total expected time to failure of the third
device is $E(\max(X_i)) = E(X_{(3)}) = 1/3\lambda + 1/2\lambda + 1/\lambda.$
This method cannot be used for the general case in which the rates are unequal because we don't know which devices will fail first and second.
However, with the condition that $X_1 < X_2 < X_3,$ we do know the order of failure, so the conditional expected time to failure can be found as in the solution attached to the question.
Simulation in R for max and min with $\lambda=2:$
set.seed(728); m=10^6; lam = 2
x1 = rexp(m,lam); x2 = rexp(m,lam); x3 = rexp(m,lam)
v = pmin(x1, x2, x3)
mean(v)
[1] 0.1664693 # aprx E(min) = 1/6
w = pmax(x1, x2, x3)
mean(w)
[1] 0.9167773 # aprx E(max) = 11/12
1/(3*lam) + 1/(2*lam) + 1/lam
[1] 0.9166667 # 11/12
Best Answer
To make things easier, I will use $X,Y,Z$ in place of $X_1,X_2,X_3$, and assume the intercept is 0 (along with some other assumptions). The ideas should extend to more general cases.
We are given regressions:
(1) $X=aY+U$, with residual $U$.
(2) $U = X-aY = bZ+V$, with residual $V$.
(3) $X = cY+dZ+W$, with residual $W$.
We'd like to show $|b|\le|d|$.
Rewrite (2) as:
(4) $X = aY+bZ+V$
Compare (4) and (3): since for a reasonably regression, $(c,d)$ should minimize $Var(W)$, we have
(5) $Var(W)\le Var(V)$.
Similarly, compare (1) and (3): since $a$ minimizes $Var(U)$, replacing $U$ with $bZ+V$ from (2), we have
(6) $Var(bZ+V)\le Var(dZ+W)$.
Since for a reasonable regression, we have $Cov(Z,V)=Cov(Z,W)=0$ (otherwise we would have correlation unaccounted for by the coefficients) we can deduce from (6):
(7) $b^2Var(Z)+Var(V) \le d^2 Var(Z)+Var(W)$
With (5) we arrive at the desired result.