Nakayama Lemma. Let $N$ be a finitely generated $R$-module, and $J\subseteq R$. Suppose that $J$ is closed under addition and multiplication and $JN=N$. Then there is $a\in J$ such that $(1+a)N=0$. (Here by $JN$ we denote the subset of linear combinations of $N$ with coefficients in $J$.)
Cayley-Hamilton Theorem. Let $A$ be a commutative ring, $I$ an ideal of $A$, $M$ a finitely generated $A$-module, $\phi$ an $A$-module endomorphism of $M$ such that $\phi(M)\subseteq IM$. Then there are $n\ge 1$ and $a_i\in I^i$ such that $\phi^n+a_1\phi^{n-1}+\dots+a_n=0$.
Nakayama Lemma implies Cayley-Hamilton Theorem:
$M$ is an $A[X]$-module via $Xm=\varphi(m)$. Moreover, $M$ is also a finitely generated $A[X]$-module. By hypothesis $XM\subseteq IM$. Now consider the ring $A[X,X^{-1}]$, the localization of $A[X]$ with respect to the multiplicative set $S$ generated by $X$, and the finitely generated $A[X,X^{-1}]$-module $S^{-1}M$ (which we denote by $M[X^{-1}]$). The set $$J=\{a_1X^{-1}+\cdots+a_{r}X^{-r}:a_i\in I^i, r\ge1\}$$ is closed under addition and multiplication, and moreover $JM[X^{-1}]=M[X^{-1}]$: if $m\in M$ and since $XM\subseteq IM$ we have $Xm\in IM$. Then $Xm=a_1m_1+\cdots+a_nm_n$ with $a_j\in I$, and therefore $m=(a_1X^{-1})m_1+\cdots+(a_nX^{-1})m_n\in JM[X^{-1}]$.
Now by Nakayama Lemma (for $R=A[X,X^{-1}]$ and $N=M[X^{-1}]$) there is $p\ge 1$ such that $(1+a_1X^{-1}+\cdots+a_{p}X^{-p})M[X^{-1}]=0$. In particular, $(1+a_1X^{-1}+\cdots+a_{p}X^{-p})M=0$, that is, $\dfrac{(X^p+a_1X^{p-1}+\cdots+a_p)m}{X^p}=0$ for all $m\in M$. Since $M$ is finitely generated there is $s\ge0$ such that $X^s(X^p+a_1X^{p-1}+\cdots+a_p)M=0$. Now set $n=s+p$ and conclude that $\varphi^{n}+a_1\varphi^{n-1}+\cdots+a_n=0$ with $a_i\in I^i$.
You can apply the following powerful idea: think of Cayley-Hamilton as a statement about the "universal matrix," the one whose entries are indeterminates $x_{ij}$ living in a polynomial ring $\mathbb{Z}[x_{ij}]$. The statement is that $P(X) = 0$ where $P$ is a polynomial whose coefficients are polynomials in the $x_{ij}$, so this statement itself is, for $n \times n$ matrices, a collection of $n^2$ polynomial identities in $n^2$ variables over $\mathbb{Z}$, or equivalently a collection of $n^2$ polynomials that you would like to vanish. Now:
Claim: Let $f(y_1, \dots y_k)$ be a polynomial in any number of variables over $\mathbb{Z}$. The following are equivalent:
- $f$ is identically zero (in the sense that all of its coefficients are zero).
- $f(y_1, \dots y_k) = 0$ for $y_i$ every element of every commutative ring.
- $f(y_1, \dots y_k) = 0$ for $y_i$ every element of a fixed infinite field $K$ of characteristic zero.
Proof. The implications $1 \Rightarrow 2 \Rightarrow 3$ are immediate from the definitions, so it remains to prove $3 \Rightarrow 1$. This can be done by induction on $k$: for $k = 1$ this reduces to the observation that a nonzero polynomial has finitely many roots, and the inductive step proceeds by fixing some of the variables and varying the others. We can also appeal to the combinatorial Nullstellensatz. $\Box$
Now we can prove Cayley-Hamilton over every commutative ring by proving it over any infinite field of characteristic zero (note that we crucially needed to use the fact that the polynomials involved have integer coefficients to get this freedom). In particular we can work over an algebraically closed field, where the proof can be organized as follows:
- As you already observed, Cayley-Hamilton is easy to prove for diagonalizable matrices.
- Now your second observation, in geometric terms, says that the diagonalizable matrices are Zariski dense in all matrices, meaning any polynomial vanishing on the diagonalizable matrices must vanish identically. This is a consequence of the fact that matrices with distinct eigenvalues are Zariski open, because their complement is matrices such that the discriminant of the characteristic polynomial (which is a polynomial) vanishes, and in any irreducible variety (meaning the ring of polynomial functions is an integral domain - you use this property crucially) Zariski opens are Zariski dense (this is essentially what you prove).
Lots of other results about matrices can be proven this way. For example:
Exercise: Let $A, B$ be $n \times n$ matrices. Then $AB$ and $BA$ have the same characteristic polynomial.
(I tried to spoiler tag this but the syntax I found didn't work. Anyone know what's going on with that?)
Proof. The statement that $\det(tI - AB) = \det(tI - BA)$ is a collection of $n$ polynomial identities in $2n^2$ variables $a_{ij}, b_{ij}$ (the coefficients of the "universal pair of matrices"), or equivalently a single polynomial identity in $2n^2 + 1$ variables, so as above to prove this statement over every commutative ring it suffices to prove it over a fixed infinite field. The statement is clearly true if, say, $A$ is invertible, since then $AB$ and $BA$ are conjugate, and now we use the fact that invertible matrices are Zariski open (defined by the nonvanishing of the determinant), hence Zariski dense, in all matrices. (It's also possible to avoid use of the Zariski topology by working over $\mathbb{R}$ or $\mathbb{C}$ with the Euclidean topology and showing that invertible matrices are dense in the usual sense here.)
Here is a cleaner algebraic reformulation of the proof, working just over the universal ring $\mathbb{Z}[a_{ij}, b_{ij}]$. Observe that
$$\det(A) \det(tI - BA) = \det(tA - ABA) = \det(tI - AB) \det(A)$$
and now use the fact that $\mathbb{Z}[a_{ij}, b_{ij}]$ is an integral domain (so, geometrically, its spectrum is an irreducible affine scheme), so we can cancel $\det(A)$ from both sides, despite the fact that it is not true that the the determinant of a matrix always vanishes. $\Box$
Best Answer
My previous answer made a false claim -- that we wanted to view $M$ as an $M_{n\times n}(R[x])$-module. In fact, this will not work in general: while given an endomorphism $\phi\in\operatorname{End}_R(M)$ and a generating set $\{m_1,\dots, m_n\}$ of $M$ we may produce a matrix $A_\phi\in M_{n\times n}(R)$ such that $$\require{AMScd} \begin{CD} R^n @>A_\phi>> R^n \\ @V\pi VV @VV\pi V\\ M @>>\phi > M \end{CD} $$ commutes, it is not the case that in general a matrix $B\in M_{n\times n}(R)$ induces a well-defined endomorphism of $M.$ However, this doesn't mean we can't use the main idea that $(xI - A)^{\textrm{adj}}(xI -A) = \det(xI - A)I\in M_{n\times n}(R[x]),$ we just need to be careful.
First, let's choose our generating set $\{m_1,\dots, m_n\}$ of $M$ and our matrix representation $A_\phi$ of $\phi$ with respect to this generating set. Explicitly, we have some collection of constants $r_{ij}\in R$ such that $$ \phi(m_i) = \sum_{j=1}^n r_{ij} m_j. $$ If we let $\delta_{ij} = \begin{cases} 1,\quad i = j\\ 0,\quad i\neq j\end{cases}$ and we consider $M$ as an $R[x]$-module where $x$ acts on $M$ by $xm = \phi(m),$ then the previous equation is equivalent to $$ \sum_{j}(x\delta_{ij} - a_{ij})m_j = 0. $$
Observe that if we assemble the coefficients of the $m_j$ as we range over all $j$ and all $i$ into a matrix, we obtain $$(x\delta_{ij} - a_{ij})_{ij} = xI - A_\phi.$$ Now we apply the adjugate trick. Write $(xI - A_\phi)^{\textrm{adj}} = (b_{ij})_{ij}.$ Then the fact that $(xI - A_\phi)^{\textrm{adj}}(xI - A_\phi) = \det(xI - A_\phi) I$ means that $$ \sum_{k=1}^n b_{ik}(x\delta_{kj} - a_{kj}) = \det(xI - A_\phi)\delta_{ij}. $$ Taking our equation $0 = \sum_{j}(x\delta_{kj} - a_{kj})m_j$ and multiplying by $b_{ik},$ we have $$ 0 = \sum_j b_{ik}(x\delta_{kj} - a_{kj})m_j. $$ Next we sum these equations over $k$: \begin{align*} 0 &= \sum_{k=1}^n\sum_{j=1}^n b_{ik}(x\delta_{kj} - a_{kj})m_j\\ &=\sum_{j=1}^n\sum_{k=1}^n b_{ik}(x\delta_{kj} - a_{kj})m_j\\ &= \sum_{j=1}^n\det(xI - A_\phi)\delta_{ij} m_j\\ &= \det(xI - A_\phi)m_i. \end{align*} This holds for any $i,$ so that $\det(xI - A_\phi) = p(x)$ acts on $M$ identically as zero; i.e., $p(\phi) : M\to M$ is zero.