I've simply included the diagrams (here they're larger than on the webpage you linked us to!), along with the caption below the figure: "Key concepts".
Together, with thoroughly reading the "Correctness" (and proof of algorithm) section of the exercise, hopefully you'll be able to answer your question. If not, feel free to edit your post and specify, in much greater detail than you first provided us, what it is exactly that you're confused about. We'll be in a much better position that way to satisfactorily answer your question(s).
Figure #33.11: Key concepts in the proof that the closest-pair algorithm needs to check only $7$ points following each point in the array Y'.
(a) If $p_L \in P_L$ and $p_R \in P_R$ are less than $\delta$ units apart, they must reside within a $\delta \times 2\delta$ rectangle centered at line $1$.
(b) How $4$ points that are pairwise at least $\delta$ units apart can all reside within a $\delta \times \delta$. On the left are $4$ points in $P_L$, and on the right are $4$ points in $P_R$. There can be $8$ points in the $\delta \times 2\delta$ rectangle if the points shown on line $1$ are actually pairs of coincident points with one point in $P_L$ and one in $P_R$.
This last point may be the key, in terms of understanding why the there can be no more than $8$ points in a $\delta \times 2\delta$ rectangle. The maximal number of points in a $\delta \times \delta$ square, such that any pair of points is at least $\delta$ units of distance apart is at most $4$.
Why? Try dividing a $\delta \times \delta$ square into $4$ squares (each of dimension $\delta/2 \times \delta/2$. The maximum length between any possible two points in any given subdivided square is the length of its diameter: $\sqrt{2}\delta/2$, which is less than $\delta$. In this way, each such subdivided square can contain at most one point, and, thus, at most $4$ points can be contained in the $\delta \times \delta$ square.
Start with:
$$2^{3.1} = 2^3 2^{0.1} = 2^3 e^{0.1 \log{2}}$$
Now use a Taylor expansion, so that the above is approximately
$$2^3 \left [1+0.1 \log{2} + \frac{1}{2!} (0.1 \log{2})^2 + \frac{1}{3!} (0.1 \log{2})^3+\cdots + \frac{1}{n!} (0.1 \log{2})^n\right ] $$
wheer $n$ depends on the tolerance you require. In this case, if this error tolerance is $\epsilon$, then we want
$$\frac{1}{(n+1)!} (0.1 \log{2})^{n+1} \lt \epsilon$$
For example, if $\epsilon=5 \cdot 10^{-7}$, then $n=4$.
Best Answer
First, consider the simplified problem: find a black point $x_b, y_b$ and a white point $x_w, y_w$ such that:
That is, we only consider pairs of point such that the white point is oriented on the top right of the black point. Because of properties (1) and (2), the manhattan distance between the two points become
$$|x_b - x_w| + |y_b - y_w| = x_w - x_b + y_w - y_b = (x_w + y_w) - (x_b + y_b)$$
This problem can be solved by sorting the points and sweeping them from left to right. Details is at the bottom of this explanation.
Thus, in order to find the closest pair of black and white point, we do the following $4$ times:
Sweeping details
Sort the points by their $x$ coordinates (if they are tied, then black point is sorted before white point). For each point $(x, y)$, we associate the value $(x+y)$ with it. Now, we sort the points according to its $x$ coordinates, and sweep these points from left to right. Now, initialize key-value data structure, and for each white point $(x, y)$, store $(y, (x+y))$ in this data structure (here $y$ is the key, and $(x+y)$ is the value).
Now, iterate the list of sorted points. Whenever we encounter a white point $(x, y)$, we remove it $(y, (x+y))$ from the data structure. Whenever we encounter a black point $(x_b, y_b)$, we have that the $x$ coordinates of all white points in the data structure is greater than $x_b$. Thus, in order to satisfy condition (2), we need to have $y_w \ge x_w$. Since we want to minimize the manhattan distance, i.e., $(x_w + y_w) - (x_b - y_b)$, we want to minimize $x_w + y_w$. This corresponds exactly to querying our key-value data structure for the minimum value such that the key is at least $y_w$.
If we use a data structure that supports the above operations in $O(\log n)$ time, we have an $O(n \log n)$ algorithm. An example of such data structure is the balanced binary tree.