No, the converse of Cayley-Hamilton is not true for $n\times n$ matrices with $n\gt 1$; in particular, it fails for $2\times 2$ matrices.
For a simple counterexample, notice that if $p(A)=0$, then for every multiple $q(x)$ of $p(x)$ you also have $q(A)=0$; so you would want to amend the converse to say "if $p(A)=0$, then $p(a)$ is a multiple of the characteristic polynomial of $A$". But even that amended version is false
However, the only failure in the $2\times 2$ matrix case are the scalar multiples of the identity. If $A=\lambda I$, then $p(x)=x-\lambda$ satisfies $p(A)=0$, but the characteristic polynomial is $(x-\lambda)^2$, not $p(x)$.
For bigger matrices, there are other situations where even this weakened converse fails.
The concept that captures the "converse" of Cayley-Hamilton is the minimal polynommial of the matrix, which is the monic polynomial $p(x)$ of smallest degree such that $p(A)=0$. It is then easy to show (using the division algorithm) that if $q(x)$ is any polynomial for which $q(A)=0$, then $p(x)|q(x)$. (Be careful to justify that if $m(x)=r(x)s(x)$, then $m(A)=r(A)s(A)$; this is not immediate because matrix multiplication is not in general commutative!) So we have:
Theorem. Let $A$ be an $n\times n$ matrix over $\mathbf{F}$, and let $\mu(x)$ be the minimal polynomial of $A$. If $p(x)\in \mathbf{F}[x]$ is any polynomial such that $p(A)=0$, then $\mu(x)$ divides $p(x)$.
The Cayley-Hamilton Theorem shows that the characteristic polynomial is always a multiple of the minimal polynomial. In fact, one can prove that every irreducible factor of the characteristic polynomial must divide the minimal polynomial. Thus, for a $2\times 2$ matrix, if the characteristic polynomial splits and has distinct roots, then the characteristic and minimal polynomial are equal. If the characteristic polynomial is irreducible quadratic and we are working over $\mathbb{R}$, then again the minimal and characteristic polynomials are equal. But if the characteristic polynomial is of the form $(x-a)^2$, then the minimal polynomial is either $(x-a)$ (when the matrix equals $aI$), or $(x-a)^2$ (when the matrix is not diagonalizable).
As for solving this problem: if $\lambda$ is an eigenvalue of $A$, and $A$ is invertible, then $\lambda\neq 0$, and $\frac{1}{\lambda}$ is an eigenvalue of $A^{-1}$: for if $\mathbf{x}\neq\mathbf{0}$ is such that $A\mathbf{x}=\lambda\mathbf{x}$, then multiplying both sides by $A^{-1}$ we get $\mathbf{x} = A^{-1}(\lambda \mathbf{x}) = \lambda A^{-1}\mathbf{x}$. Dividing through by $\lambda$ shows $\mathbf{x}$ is an eigenvector of $A^{-1}$ corresponding to $\frac{1}{\lambda}$.
Since $A=A^{-1}$, that means that if $\lambda_1,\lambda_2$ are the eigenvalues of $A$, then $\lambda_1 = \frac{1}{\lambda_1}$ and $\lambda_2=\frac{1}{\lambda_1}$; thus, each eigenvalue is either $1$ or $-1$.
If the matrix is diagonalizable, then we cannot have both equal to $1$ (since then $A=I$), and they cannot both be equal to $-1$ (since $A\neq -I$), so one eigenvalue is $1$ and the other is $-1$. Since the trace of a square matrix equals the sum of its eigenvalues, the sum of the eigenvalues is $0$.
Why is $A$ diagonalizable? If it has two distinct eigenvalues, $1$ and $-1$, then there is nothing to do; we know it is diagonalizable. If it has a repeated eigenvalue, say $1$, but $A-I$ is not the zero matrix, pick $\mathbf{x}\in \mathbb{R}^2$ such that $A\mathbf{x}\neq \mathbf{x}$; then $$\mathbf{0}=(A-I)^2\mathbf{x} = (A^2-2A + I)\mathbf{x} = (2I-2A)\mathbf{x}$$
by the Cayley Hamilton Theorem. But that means that $2(A-I)\mathbf{x}=\mathbf{0}$, contradicting our choice of $\mathbf{x}$. Thus, $A-I=0$, so $A=I$ and $A$ is diagonalizable. A similar argument shows that if $-1$ is the only eigenvalue, then $A+I=0$.
.
(Hiding behind that paragraph is the fact that if the minimal polynomial is squarefree and splits, then the matrix is diagonalizable; since $p(x)=x^2-1=(x-1)(x+1)$ is a multiple of the minimal polynomial, the matrix must be diagonalizable).
So this completes the proof that the trace must be $0$, given that $A\neq I$ and $A\neq -I$.
There are many, many ways to prove the Cayley-Hamilton theorem, and many of them have something to offer in the way of simplicity, generality, or of providing insight into why such a result could be expected to hold in the first place. For instance the case of distinct eigenvalues you could prove can be seen to suffice, once one realises that $p_{A}(A)=0\in\mathrm{Mat}_n(\mathbf C)$ is just a (tremendously complicated) system of $n^2$ polynomial equations in the coefficients of $A$, each of which will hold everywhere once it holds on the dense subset of matrices with $n$ distinct eigenvalues.
Personally I find the following approach leads most naturally to the Cayley-Hamilton theorem, i.e., it emerges quite logically from general considerations that are anyway central to eigenvalue problems, rather than being a claim coming out of the blue.
Certainly (by finite dimensionality of the matrix space) there exist nonzero polynomials $P$ with $P[A]=0$, and by a standard Euclidean division argument there is a polynomial $\mu_A$ such that this happens exactly when $P$ is a (polynomial) multiple of $\mu_A$, the minimal polynomial of $A$. An easy but crucial observation is $(*)$: for any polynomial $D$ dividing $\mu_A$, the minimal polynomial of the restriction of (the linear map defined by) $A$ to the subspace that is the image of $D[A]$ is given by $\mu_A/D$. To see this, for any polynomial$~Q$ the restriction of$~Q[A]$ to the image of $D[A]$ is zero if and only if $Q[A]\circ D[A]=(QD)[A]$ is zero on the whole space.
Every eigenvalue$~\lambda$ of$~A$ is a root of $\mu_A$, since applying $\mu_A(A)=0$ to an eigenvector multiplies it by $\mu_A(\lambda)$. From $(*)$ it follows that conversely every root$~\lambda$ of $\mu_A$ is eigenvalue of$~A$ (the polynomial $D=X-\lambda$ divides $\mu_A$, so $D(A)=A-\lambda I$ is not surjective, as the restriction of$~A$ to its image has minimal polynomial $\mu_A/D\neq\mu_A$). If $m$ is the multiplicity if $X-\lambda$ as (irreducible) factor of $\mu_A$, then the kernel of $(A-\lambda I_n)^m$ is by definition the characteristic subspace (or generalised eigenspace) for$~\lambda$ of$~A$. The restriction of$~A$ to the image of $(A-\lambda I_n)^m$ has no eigenvalue $\lambda$ (its minimal polynomial $\mu_A/(X-\lambda)^m$ has no root$~\lambda$), so forms a direct sum with the characteristic subspace for $\lambda$; by the rank-nullity theorem they form a direct sum decomposition of the whole space. By induction on the number of eigenvalues, the space then decomposes as direct sum of characteristic subspaces, provided $\mu_A$ splits into linear factors, as certainly happens over an algebraically closed fields like the complex numbers.
All this is standard stuff about the minimal polynomial. Visibly it has the same roots as the characteristic polynomial $p_A$ (namely all eigenvalues), and it is natural to compare the multiplicities. The multiplicity of $\lambda$ as root of $\mu_A$ is the number $m$ above, and its multiplicity as a root of $p_A$ is the dimension of the characteristic subspace for $\lambda$. By $(*)$, the powers $(A-\lambda I_n)^i$ for $i=0,1,\ldots,m$ must all have distinct (ever smaller) images, and therefore also all have distinct (ever larger) kernels, so the dimension of the largest of these kernels, the characteristic subspace for $\lambda$, is at least $m$. This means that (assuming an algebraically closed field) the characteristic polynomial is a multiple of $\mu_A$; thus $p_A(A)=0$, the Cayley-Hamilton theorem.
The algebraically closed field case can be seen to imply the case over arbitrary fields, since neither the characteristic polynomial, not the minimal polynomial, nor the divisibility relation of them changes when the field is extended. If you don't like this flight into algebraically closed fields (in fact just a splitting field of the minimal polynomial will suffice), there is an alternative approach that works with cyclic subspaces (spanned by iterated images of a single nonzero vector) instead of eigenspaces. This leads to a basic case involving companion matrices, and an induction argument; see this answer.
Best Answer
Brunton's comment is strange, even in context. He claims someone unspecified pointed out to him there may be exceptions, but settles for claiming "almost every" square matrix satisfies the theorem, as he didn't want to elaborate on edge cases. (This is unfortunate for anyone who hopes they can apply the theorem at some point.)
The comments have discussed the fact that matrices not on a commutative ring may be exceptions, but I don't think he had these in mind. If he did, his language should have been more careful, because "almost every" means the set of counterexamples should be of measure $0$.
I actually think it's more likely that he and an unnamed colleague are data scientists and not linear algebra experts, leading to sloppiness on their part. What is true is that: