Comparing the question with the actual proof from the referred book, some subtle but important aspects have been left out from the former:
1) This part of the proof is about existence of a solution to the likelihood equation $\frac{\partial l(\theta)}{\partial \theta}=0$, that converges to the true parameter, and not about "consistency of the mle estimator".
2) The probability of $S_n$ tends to $1$. Then, by necessity, a $\hat \theta: \hat \theta \in (\theta_0 -a, \theta_0 +a)$ will exist for the $\mathbf X$ that forms the elements of $S_n$.
Then the proof states that as a consequence,
$$S_n \subset \{ \mathbf{X}: | \hat{ \theta_{n} } \left( \mathbf{X} \right) -\theta_{0} | < a \} \cap \{ \mathbf{X}: l^{ \prime} \left( \hat{\theta_n} \left( \mathbf{X} \right) \right) =0 \}$$
What does the RHS-set intersection describe?
It describes a data set for which a) $\hat \theta$ is a solution to the likelihood equation (2nd set) and
b) it is less than $a$-away from the true parameter $\theta_0$ (1st set). And it asserts that the data set of $S_n$ is a subset of this data set of the intersection.
And indeed it is, since $\hat \theta$ may satisfy the two conditions (being a solution to the likelihood equation and being less than $a$-away from the true parameter) for a data set larger than the data set that forms $S_n$ and which is characterized by a condition related to the value of the likelihood at the true parameter (unrelated to $\hat \theta$).
The proof then goes on to show that these imply that $\hat \theta$ will be less than $a$-away from the true parameter in probability, and then, that if $\hat \theta$ is unique, and so coincides with the MLE estimator, the latter is consistent.
In a reply concerning interpreting probability densities I have argued for the merits of including the differential terms ($dx$ and $dy$ in this case) in the PDF, so let's write
$$f(x,y; \theta,\lambda) = \frac{1}{\lambda x \sqrt{2\pi}} \exp\left(-\frac{(y-x\theta)^2}{2x^2} - \frac{x}{\lambda}\right)\,dx\,dy.$$
In (d), attention is focused on $\hat{\theta}$ which appears to be a multiple of $\sum_i y_i/x_i$ where each $(x_i,y_i)$ is an independent realization of the bivariate random variable described by $f$. To tackle this, let's consider the distribution of $Z=Y/X$--and then we will later have to sum a sequence of independent versions of this variable in order to determine the distribution of $\hat\theta$.
In contemplating the change of variable $(X,Y)\to (X,Z) = (X,Y/X)$, we recognize that the non-negativity of $X$ assures this induces an order-preserving, one-to-one correspondence between $Y$ and $Z$ for each $X$. Therefore all we need to do is change the variable in the integrand, writing $y=x z$:
$$f(x,x z; \theta,\lambda) = \frac{1}{\lambda x \sqrt{2\pi}} \exp\left(-\frac{(x z-x\theta)^2}{2x^2} - \frac{x}{\lambda}\right)\,dx\,d(x z).$$
From the product rule $d(x z) = z dx + x dz$, and understanding the combination $dx\, d(x z)$ in the sense of differential forms $dx \wedge d(x z)$, we mechanically obtain
$$dx\, d(x z) = dx \wedge d(x z) = dx \wedge (z dx + x dz) = z dx \wedge dx + x dx \wedge dz = x\,dx\,dz.$$
(This is the easy way to obtain the Jacobian determinant of the transformation.)
Therefore the distribution of $(X,Z)$ has PDF
$$\eqalign{
g(x,z;\theta,\lambda) = f(x, x z; \theta, \lambda) &= \frac{1}{\lambda x \sqrt{2\pi}} \exp\left(-\frac{(x z-x\theta)^2}{2x^2} - \frac{x}{\lambda}\right)\,x dx\,dz \\
&= \frac{1}{\lambda} \exp\left( - \frac{x}{\lambda}\right)\frac{1}{\sqrt{2\pi}}\,dx\ \exp\left(-\frac{( z-\theta)^2}{2}\right) \,dz.
}$$
Because this has separated into a product of a PDF for $X$ and a PDF for $Z$, we see without any further effort that (1) $X$ and $Z$ are independent and (2) $Z$ has a Normal$(\theta,1)$ distribution.
Finding the joint distribution of $(\hat{\lambda}, \hat\theta)$, which are easily expressed in terms of $X$ and $Z$, is now straightforward.
Best Answer
In this problem your unknown parameter $\theta$ only has two possible values, so you have a discrete optimisation where you just have to compare the likelihood at those two parameter values. (If you are taking derivatives of something in a discrete optimisation then you are going down the wrong track.) For an observed data vector $\mathbf{x}$ you have:
$$L_\mathbf{x}(\theta) = \begin{cases} (2 \pi)^{-n/2} \exp(-\tfrac{1}{2} \sum x_i^2) & & \text{for } \theta = 1, \\[6pt] \pi^{-n} / \prod (1+x_i^2) & & \text{for } \theta = 2. \\ \end{cases}$$
Since there are only two possible parameter values, you can find the maximising parameter value by looking at the sign of the difference in likelihood at these values. You have:
$$\begin{equation} \begin{aligned} \Delta(\mathbf{x}) \equiv \text{sgn}(L_\mathbf{x}(1)-L_\mathbf{x}(2)) &= \text{sgn}\Bigg( (2 \pi)^{-n/2} \exp(-\tfrac{1}{2} \sum x_i^2) - \frac{1}{\pi^{n} \prod (1+x_i^2)} \Bigg) \\[6pt] &= \text{sgn}\Bigg( \exp(-\tfrac{1}{2} \sum x_i^2)\prod (1+x_i^2) - \Bigg( \frac{2}{\pi} \Bigg)^{n/2} \Bigg). \\[6pt] \end{aligned} \end{equation}$$
The maximum-likelihood-estimator (MLE) is:
$$\hat{\theta} = \begin{cases} 2 & & \text{if } \Delta(\mathbf{x}) = -1, \\[6pt] \{ 1,2 \} & & \text{if } \Delta(\mathbf{x}) = 0, \\[6pt] 1 & & \text{if } \Delta(\mathbf{x}) = 1. \\[6pt] \end{cases}$$
(In the case where $\Delta(\mathbf{x}) = 0$ the MLE is non-unique since the likelihood is the same at both parameter values.)