This answer assumes the matrices are taken over $\mathbb C$.
Yes, the statement is still true even if the matrix isn't diagonalizable.
For the proof you saw it is sufficient that $D$ can be taken to be an upper triangular matrix (and it can be taken in such a way, this is Schur's Decomposition Theorem). This is enough because its diagonal entries will be the eigenvalues of the starting matrix.
Jordan Canonical Form is also sufficient, but Schur's Decomposition is a weaker condition.
For completeness I'll add the proofs here.
Let $n\in \mathbb N$ and $A\in \mathcal M_n(\mathbb C)$. Let $\lambda _1, \ldots ,\lambda _n$ be the eigenvalues of $A$. The characteristic polynomial $p_A(z)$ of $A$ is $\color{grey}{p_A(z)=}(z-\lambda _1)\ldots (z-\lambda _n)$.
Schur's Decomposition guarantees the existence of an invertible matrix $P$ and an upper triangular matrix $U$ such that $A=PUP^{-1}$ and $U$'s diagonal entries are exactly $\lambda _1, \ldots ,\lambda _n$.
Since similarity preserves the characteristic polynomial, it follows that the characteristic polynomial $p_U(z)$ of $U$ is $\color{grey}{p_U(z)=}(z-\lambda _1)\ldots (z-\lambda _n)$, therefore $U$ and $A$ have the same eigenvalues with the same algebraic multiplicity.
From the fact that $U$'s diagonal entries are $\lambda _1, \ldots ,\lambda _n$ it follows that the trace of $U$ is the sum of the eigenvalues of $A$ and the determinant of $U$ is the product of the eigenvalues of $A$.
Trace properties yield the following $$\text{tr}(A)=\text{tr}\left(PUP^{-1}\right)=\text{tr}\left(UP^{-1}P\right)=\text{tr}(U),$$ thus proving that the sum of the eigenvalues of $A$ equals $\text{tr}(A)$.
Similarly for the determinant it holds that $$\det(A)=\det\left(PUP^{-1}\right)=\det\left(P\right)\det\left(U\right)\det\left(P^{-1}\right)=\det(U),$$
hence the product of teh eigenvalues of $A$ equals the determinant of $A$.
The cyclicity property is exactly the same as the statement that $f(AB)=f(BA)$, where the function $f$ is trace or determinant or some other invariant. In practice all cyclic $f$ are functions of the coefficients of the characteristic polynomial, a polynomial that is the same for $AB$ and $BA$.
From multilinear algebra we know that the coefficients of the characteristic polynomial of operator $X$ that acts on vector space $V$ are (up to a $\pm$ sign) traces of $X$ acting on exterior powers $\Lambda^i(V)$. The top exterior power, with $i = \dim V$, gives the determinant, and is a one-dimensional vector space. Operators on $V$ induce operators of a 1-dimensional vector space -- so that the latter must commute with each other. This is why for $f = \det$, $f(A_1 \cdots A_n)$ is invariant under permutations; the actions of $A_i$ on $\Lambda^{\dim V}$ commute with each other. For the other coefficients such as the trace $\Lambda^i$ has higher dimension and linear operators on the associated exterior powers do not always commute.
It would be nice to know a more concrete combinatorial description of this commutativity, that does not use exterior powers. This is probably the same as the problem of seeing that permutation sign is multiplicative, without using determinants of permutation matrices.
Best Answer
For symmetric matrices $M$, there is the following geometric picture: the trace is proportional to the average squared norm of all vectors on the unit sphere, with respect to inner product $M$:
$$\operatorname{tr}(M) \propto \int_{\|\mathbf{v}\|=1} \mathbf{v}^T M \mathbf{v}\,dA,$$ where the proportionality constant depends on the dimension of $M$. You can prove this formula by noticing that when $M = e_i e_j^T$ for $i\neq j$, the integral is zero by symmetry arguments; for general $M$, since the formula is linear in $M$, you can break $M$ up into a sum of matrices each containing exactly one nonzero.
Now for any rotation $R$, $\operatorname{tr}(R^TMR) = \operatorname{tr}(M)$, since the conjugation by $R$ simply "spins the unit sphere." So you might as well take $R$ to be the matrix diagonalizing $M$.