The reason we need the lemma is that from $P(t)=b(t)(A-tI)$ one cannot directly conclude that $P(A)=b(A)(A-AI)$.
If $R$ is a commutative ring, then there is a natural map $R[t]\to R^R$ which is a ring homomorphism (we endow $R^R$ with the pointwise ring structure: $(f+g)(r) = f(r)+g(r)$, and $fg(r) = f(r)g(r)$ for every $r\in R$). If $p(t)=q(t)s(t)$, then for every $r\in R$ you have that $p(r)=q(r)s(r)$.
But this doesn't work if $R$ is not commutative. For example, taking $p(t) = at$, $q(t) = t$ and $s(t)=a$, you have $p(t)=q(t)s(t)$ in $R[t]$ (since $t$ is central in $R[t]$ even when $R$ is not commutative), but $p(r) = ar$ while $q(r)s(r) = ra$. So you get $p(r)=q(r)s(r)$ if and only if $a$ and $r$ commute. Thus, while you can certainly define a map $\psi\colon R[t]\to R^R$ by
$$\psi(a_0+a_1t+\cdots+a_nt^n)(r) = a_0 + a_1r + \cdots + a_nr^n,$$
this map is not a ring homomorphism when the ring is not commutative. This is the situation we have here, where the ring $R$ is the ring $n\times n$ matrices over $\mathbb{K}$, which is not commutative when $n\gt 1$. In particular, from $P(t) = B(t)(A-tI)$ one cannot simply conclude that $P(A)=B(A)(A-AI)$. This implicitly assumes that your map $M_n(\mathbb{K})[t]\to M_n(\mathbb{K})^{M_n(\mathbb{K})}$ is multiplicative, which it is not in this case.
If your $A$ happens to be central in $M_n(\mathbb{K})$, then it is true that the induced map $M_n(\mathbb{K})[t]\to M_n(\mathbb{K})$ is a homomorphism. But then you would be assuming that your $A$ is a scalar multiple of the identity. It would also be true if the coefficients of the polynomial $b(t)$ centralize $A$, but you are not assuming that. So you do need to prove that in this case you have $P(A)=b(A)(A-AI)$, since it does not follow from the general set-up (the way it would in a commutative setting).
P.S. In fact, this is the subtle point where the proof that a polynomial over a field of degree $n$ has at most $n$ roots breaks down for skew fields/division rings. If $K$ is a division ring, then the division algorithm holds for polynomials with coefficients over $K$, so one can show that for every $p(t)\in K[t]$ and $a(t)\in K[t]$, $a(t)\neq 0$, there exist unique $q(t)$ and $r(t)$ such that $p(t)=q(t)a(t) + r(t)$ and $r(t)=0$ or $\deg(r)\lt \deg(a)$. From this, we can deduce that for every polynomial $p(t)$ and for every $a\in K$, we can write $p(t) = q(t)(t-a) + r$, where $r\in K$. But the proof of the Remainder and Factor Theorems no longer goes through, because we cannot go from $p(t)=q(t)(t-a)+r$ to $p(a)=q(a)(a-a)+r$; and you cannot get the recursion argument to work, because from $p(t)=q(t)(t-a)$, and $p(b)=0$ with $b\neq a$, you cannot deduce that $q(b)=0$. For instance, over the real quaternions, we have $p(t)=t^2+1=(t+i)(t-i)$, but $p(j)=j^2+1\neq 2k = ij-ji = (j+i)(j-i)$. I remember when I first learned the corresponding theorems for polynomial rings, the professor challenging us to identify all the field axioms used in the proofs of the Remainder and Factor Theorem; none of us spotted the use of commutativity in the evaluation map.
"The" proof of the Cayley-Hamilton Theorem involves invariant subspaces, or subspaces that are mapped onto themselves by a linear operator. If $T$ is a linear operator on a vector space $V$, then a subspace $W\subseteq V$ is called a $T$-invariant subspace of $V$ if $T(W)\subseteq W$, i.e. if $T(v)\in W$ for every $v\in W$. Some examples of $T$-invariant subspaces you might be familiar with are $\{0\}, N(T), R(T), V$, and $E_\lambda$ for any eigenvalue $\lambda$ of $T$. For a linear operator $T$ and any nonzero $x\in V$, then the subspace
$$ W=\textrm{span}(\{x,T(x),T^2(x),\dots\})$$
is called the $T$ cyclic subspace of $V$ generated by $x$, and one can show that $W$ is the smallest $T$-invariant subspace containing $x$. Cyclic subspaces can be used to establish the Cayley-Hamilton Theorem. In fact, the existence of a $T$-invariant subspace allows us to define a new linear operator whose domain is this subspace, i.e. the restriction $T_W$ of $T$ to $W$ is a linear operator from $W$ to $W$. These two operators are linked in the sense that the characteristic polynomial of $T_W$ divides the characteristic polynomial of $T$. You can show this by choosing your favorite ordered basis for $W$ and extending it to an ordered basis for $V$, then taking the matrix representations of $T$ and $T_W$, and computing the characteristic polynomial of $T$, one will see that the characteristic polynomial of $T_W$ can be recovered.
The last tool we will need is how to gain information about the characteristic polynomial of $T$ from the characteristic polynomial of $T_W$. Cyclic subspaces are useful in this sense because the characteristic polynomial of the restriction of a linear operator $T$ to a cyclic subspace can be computed. In fact, if $T$ is a linear operator on a finite-dimensional vector space $V$, then if $W$ is the $T$ cyclic subspace of $V$ generated by a nonzero $v\in V$, and letting $k=\textrm{dim}(W)$, then we have that:
- $\{v,T(v),T^2(v),\dots,T^{k-1}(v)\}$ is a basis for $W$
- If $a_0v+a_1T(v)+\cdots+a_{k-1}T^{k-1}(v)+T^k(v)=0$, then the characteristic polynomial of $T_W$ is $f(t)=(-1)^k(a_0+a_1t+\cdots+a_{k-1}t^{k-1}+t^k)$
I will omit the proof for the above theorem unless requested, since the main goal is the proof of the Cayley-Hamilton Theorem, which states that:
Let $T$ be a linear operator on a finite-dimensional vector space $V$,
and let $f(t)$ be the characteristic polynomial of $T$. Then
$f(T)=T_0$, the zero transformation. That is, $T$, "satisfies" its
characteristic equation.
Proof: To show that $f(T)(v)=0$ for all $v\in V$. If $v=0$, we are done since $f(T)$ is linear, so suppose $v\neq 0$, and let $W$ be the $T$-cyclic subspace generated by $v$ with dimension $k$. By the theorem above, there exist scalars $a_0,\dots,a_{k-1}$ such that
$$a_0v+a_1T(v)+\cdots+a_{k-1}T^{k-1}(v)+T^k(v)=0 $$
and the characteristic polynomial for $T_W$ is:
$$ g(t)=(-1)^k(a_0+a_1t+\cdots+a_{k-1}t^{k-1}+t^k)$$
Combining these two inequalities yields:
$$g(T)(v)=(-1)^k(a_0I+a_1T+\cdots+a_{k-1}T^{k-1}+T^k)(v)=0 $$
We know that this polynomial divides the characteristic polynomial of $T$, $f(t)$, thus there exists a polynomial $q(t)$ such that $f(t)=q(t)g(t)$, so:
$$ f(T)(v)=q(T)g(T)(v)=q(T)(g(T)(v))=q(T)(0)=0$$
The Cayley-Hamilton Theorem for Matrices is then a corollary to the Cayley-Hamilton Theorem stated above.
Best Answer
Here is, I think, a possible answer.
From the Jacobi's identity, it follows that
$$ - \frac{d}{d \lambda} C_A(\lambda) = \frac{d}{d \lambda} \det ( \lambda I - A) = \text{tr} \left( \text{adj} ( \lambda I - A ) \frac{d}{d \lambda} ( \lambda I - A) \right) = \text{tr} ( \text{adj} ( \lambda I - A ) ) $$
Therefore,
$$ \frac{d}{d \lambda} C_A(\lambda) \Bigg|_{\lambda=0} = \text{tr} ( \text{adj} ( A ) )$$
Observe that
$$ \frac{d}{d \lambda} C_A(\lambda) \Bigg|_{\lambda=0} = (-1)^{n+1} \lim_{\lambda \rightarrow 0} \frac{C_A(\lambda) - C_A(0)}{\lambda} = (-1)^{n+1} \lim_{\lambda \rightarrow 0} \Gamma_A(\lambda) = (-1)^{n+1} \Gamma_A(0)$$
Therefore,
$$ (-1)^{n+1} \Gamma_A(0) = \text{tr} ( \text{adj} ( A ) ) $$