Explanation based on DeGroot, second edition, page 256. Consider the binomial distribution with fixed $p$
$$
P(X = k) = {n \choose k}p^k(1-p)^{n-k}
$$
Now define $\lambda = np$ and thus $p = \frac{\lambda}{n}$.
$$
\begin{align}
P(X = k) &= {n \choose k}p^k(1-p)^{n-k}\\
&=\frac{n(n-1)(n-2)\cdots(n-k+1)}{k!}\frac{\lambda^k}{n^k}\left(1-\frac{\lambda}{n}\right)^{n-k}\\
&=\frac{\lambda^k}{k!}\frac{n}{n}\cdot\frac{n-1}{n}\cdots\frac{n-k+1}{n}\left(1-\frac{\lambda}{n}\right)^n\left(1-\frac{\lambda}{n}\right)^{-k}
\end{align}
$$
Let $n \to \infty$ and $p \to 0$ so $np$ remains constant and equal to $\lambda$.
Now
$$
\lim_{n \to \infty}\frac{n}{n}\cdot\frac{n-1}{n}\cdots\frac{n-k+1}{n}\left(1-\frac{\lambda}{n}\right)^{-k} = 1
$$
since in all the fractions, $n$ climbs at the same rate in the numerator and the denominator and the last parentheses has the fraction going to $0$. Furthermore
$$
\lim_{n \to \infty}\left(1-\frac{\lambda}{n}\right)^n = e^{-\lambda}
$$
so under our definitions
$$
\lim_{n \to \infty} = {n \choose k}p^k(1-p)^{n-k} = \frac{\lambda^k}{k!}e^{-\lambda}
$$
In other words, as the probability of success becomes a rate applied to a continuum, as opposed to discrete selections, the binomial becomes the Poisson.
Update with key point from comments
Think about a Poisson process. It really is, in a sense, looking at very, very small intervals of time and seeing if something happened. The "very, very, small" comes from the need that we really only see at most one instance per interval. So what we have is pretty much an infinite sum of infinite Bernoullis. When we have a finite sum of finite Bernoullis, that is binomial. When it is infinite, but with finite probability $np=λ$, it is Poisson.
Just giving an intuitive overview. Michael Zubilevsky's lectures go through it in more technical detail.
Consider the quadratic objective $f(x) = x^T Q x + b^T x$. Geometrically, CG tries to "unwarp" some of the geometry induced by the quadratic form $Q$.
To build intuition, consider $Q \in \mathbb{R}^{2\times 2}$. If $Q = I$, then the contours of $f$ are circles. If $Q$ is diagonal, then the contours of $f$ are $x$-axis and $y$-axis aligned ellipses. If $Q$ has off-diagonal components as well, then there is "correlation" between the $x$ and $y$ directions, so the contours are ellipses whose principal axes are a combination of the $x$- and $y$-axis.
Now consider performing coordinate descent in the $Q=I$ or $Q$ diagonal case. We can reach the optimum in at most $2$ steps by viewing the problem as two $1$-dimensional minimization problems over each coordinate. We have this property because the $x$- and $y$-axis are orthogonal, and they "align" with the geometry of $Q$.
In the case where $Q$ contains off-diagonal components, CG finds orthogonal directions in $Q$'s geometry so that we can perform coordinate descent. We can use Gram-Schmidt to do the orthogonalization to find orthogonal directions, but if we start by orthogonalizing against the gradient, then we observe a lot of simplification in the update rule.
It outperforms gradient descent because by orthogonality of search directions, we ensure that we're never repeating a step along a direction we've searched in before. Thus we get rid of the "zig-zagging" of GD.
[EDIT: some additional notes here]
Best Answer
Approach #1: density
First, note that $AB$ and $BA$ are conjugate if either $A$ or $B$ is invertible, and conjugate matrices have the same trace (e.g. because the trace is the sum of the eigenvalues). But the subspace of matrices $(A, B)$ such that either $A$ or $B$ is invertible is dense in the space of all matrices (this is true with respect to various topologies; the usual topology over $\mathbb{R}$ or $\mathbb{C}$, and the Zariski topology over any field), which means the identity must hold in general.
See also this answer for other applications of this idea.
This approach may be a little circular because, depending on your definitions, the easiest way to prove that the trace of two conjugate matrices is the same might be to prove that $\text{tr}(AB) = \text{tr}(BA)$. Here is a definition which does not have this problem: if $V$ is a finite-dimensional vector space, then $\text{End}(V)$ can be naturally identified with the tensor product $V^{\ast} \otimes V$, and then the trace of an endomorphism is given by applying the dual pairing $V^{\ast} \otimes V \to k$ ($k$ the underlying field).
Approach #2: determinants
One way to define the trace is that it's the linear coefficient of the characteristic polynomial $\det (tI - A)$. This is a nice definition for several reasons, including that it's manifestly invariant under conjugation (once you know that determinants are invariant under conjugation). You can then prove, in various ways, the stronger fact that $AB$ and $BA$ have the same characteristic polynomial, which implies that they have the same trace (and is equivalent to the claim that they have the same eigenvalues, with the same multiplicities). See this blog post for proofs and discussion.
This approach suggests that one intuition for the trace is that it is the derivative of the determinant, suitably interpreted.
Approach #3: string diagrams
This proof involves the most category theory but it is also arguably the most beautiful. The intuitive idea is to think of taking the trace as "feeding a linear operator back into itself," so it looks like placing a linear operator in a circle in some sense, and then cyclicity follows from rotating the circle. This argument can be made fully rigorous; see this blog post and this blog post for the relevant pictures. After setting up appropriate notation and proving various lemmas, the proof of the cyclicity of the trace looks like this:
Among other things, approach #3 strongly suggests that the correct name for this property is cyclicity and not commutativity, because it clearly indicates that there is a generalization to cyclic symmetry for more than $2$ matrices: for example $\text{tr}(ABC) = \text{tr}(CAB)$ but is not in general equal to $\text{tr}(BAC)$.