This has undoubtedly been answered (likely multiple times) here before, so I post this at the risk of beating a dead (and decaying) horse.
Let me first link you to this page, which contains two excellent answers (I particularly recommend Keith Conrad's expository paper linked in Pierre-Yves Gaillard's answer). However, let me provide a perhaps more elementary viewpoint since, from experience, many people beginning this topic are not quite comfortable with minimal polynomial based arguments yet.
You seem to have covered part a quite adequately so let me focus on part b. I apologize in advance for the length, but I feel that this is a topic which requires thorough understanding.
The main thing to remember about commuting matrices is the fact that commuting matrices respect each other's eigenspaces. What does this mean? To talk about that, we first have to introduce the topic of an invariant subspace.
Consider a matrix mapping $A:\ V \rightarrow V$ for a vector space $V$. If there is some subspace $U$ of $V$ such that the restriction of $A$ to $U$ remains an operator in the sense that $A:\ U\rightarrow U$, then we say that $U$ is an invariant subspace of $A$. The term stable is also sometimes used. The significance of this is that $A(U) \subseteq U$, the image of $U$ is entirely contained within $U$. This way, it makes sense to talk about a restriction of the mapping to the smaller vector space $U$.
This is desirable for several reasons, the main one being that linear mappings on smaller vector spaces are easier to analyze. We can look at the action of the mapping on each invariant subspace and then piece them together to get an overall picture. This is what diagonalization does; we break down the vector space into smaller invariant subspaces, the eigenspaces, and then piece together the facts to get a simpler picture of how the mapping works. Many of the simpler, canonical representations are dependent on this fact (for example, the Jordan canonical form looks at the invariant generalized eigenspaces).
Now, if we have two commuting, diagonalizable matrices, then each eigenspace of $B$ is not only invariant under $B$ itself, but also under $A$. This is what we mean by preserving each other's eigenspaces. To see this, let $\mathbf{v}$ be an eigenvector of $B$ under eigenvalue $\lambda$. Then
$$B(A\mathbf{v}) = A(B\mathbf{v}) = \lambda A\mathbf{v}$$
so that $A\mathbf{v}$ is again an eigenvector of $B$ under eigenvalue $\lambda$. In our new language, this means that the eigenspace $E_\lambda$ of $B$ is invariant under $A$. This means it makes sense to look at the restriction of $A$ to $E_\lambda$.
Now consider the restriction of $A$ to $E_\lambda$. If all the eigenvalues of $B$ are simple (multiplicity one) then that means each eigenspace of $B$ is one dimensional. We have therefore restricted $A:\ E_\lambda \rightarrow E_\lambda$ to a mapping on a one-dimensional vector space. But this means that $A$ must take each vector of $E_\lambda$ to a scalar multiple of itself. You can check that this necessarily implies that $E_\lambda$ is also an eigenspace of $A$. Therefore, for any eigenbasis of $B$ that we take, the corresponding vectors also form an eigenbasis of $A$. This means that the two matrices are simultaneously diagonalizable; they share a common eigenbasis.
The general case is a bit more involved in that the restrictions to the invariant subspaces are more complex (they're no longer one-dimensional), but the ideas are identical.
P.S. Since you seem to be interested in physics, let me mention a crucial application of commuting operators. In quantum mechanics, you have quantities called observables, each of which is roughly speaking represented by a Hermitian matrix. Unlike in classical physics, different observables need not be simultaneously measurable (by measuring position for example, you cannot simultaneously measure momentum and vice versa) which is ultimately due to the fact that the position operator and the momentum operator do not commute (this is the underlying reasons behind the uncertainty principle). They do not have a shared basis which can represent the states of a system. Commuting operators therefore form a key element of quantum physics in that they define quantities which are compatible, i.e. simultaneously defined.
As proven in this post, the idea goes as follows: take $W$ an $B$-invariant subspace. Now, since $B$ is diagonalizable with eigenvalues $\mu_1, \dots, \mu_k$,
$$
\mathbb{k}^n = E_{\mu_1} \oplus \cdots \oplus E_{\mu_k}
$$
It suffices to see that $W = (W\cap E_{\mu_1}) \oplus \cdots \oplus ( W\cap E_{\mu_k})$ in which case one can form a basis from basis of each $W \cap E_{\mu_i}$, which will be made of eigenvalues of $B$ because it is contained in $E_{\mu_i}$. In effect, let's see both inclusion: the immediate one is that $(W\cap E_{\mu_1}) \oplus \cdots \oplus ( W\cap E_{\mu_k})\subseteq W$ since each space is contained in $W$, and the latter is a subspace.
As for the other, since $W = W \cap \mathbb{k}^k = W \cap \bigoplus_{i=1}^n E_{\mu_i}$, any element $w$ of $W$ is a sum of eigenvectors,
$$w = e_1 + \dots + e_l$$
with $e_i$ eigenvector of eigenvalue $\mu_{j_i}$. Therefore, it is sufficient to show that if $\sum_{i=1}^ke_l \in W$, then $e_1, \dots, e_l \in W$. We proceed by induction on $l$. If $l = 1$, then $e_1 = w \in W$. If $l >1$, since
$$
Bw - \mu_{j_1}w = (\mu_{j_1} - \mu_{j_1})e_1 + \dots + (\mu_{j_l} - \mu_{j_1})e_l \in W
$$
and $\mu_{j_i} - \mu_{j_1} \neq 0$, by inductive hypothesis $e_i \in W$ for $i >1$, and so finally $e_1 = w - e_2 - \dots - e_l \in W$, completing the proof.
Best Answer
$M$ can be diagonalized iff the minimal polynomial $m$ for $M$ splits completely into linear, non-repeated factors $m(\lambda)=(\lambda-\lambda_1)(\lambda-\lambda_2)(\cdots)(\lambda-\lambda_N)$. The usual proof of this involves the unique ($N-1$)-st order polynomials $p_{k}$ such that that $p_{k}(\lambda_{j})=\delta_{j,k}$. Then $\sum_{k=1}^{N}p_{k}\equiv 1$ because the sum is an $(N-1)$-st order polynomial which is $1$ in $N$ places. Therefore, $$ I = p_1(M)+p_2(M)+\cdots+p_N(M). $$ Furthermore $p_j(M)p_k(M)=0$ for $j \ne k$ because $m$ divides $p_j p_k$ for $j \ne k$. Therefore each $p_j(M)$ is a projection matrix; to see this, apply $p_j(M)$ to the above identity: $$ p_j(M)=p_j(M)^{2}. $$ Furthermore $(M-\lambda_k I)p_k(M)=0$ which implies $$ M = \lambda_1 p_1(M)+\lambda_2 p_2(M)+\cdots+ \lambda_N p_N(M). $$ If $M_1,M_2,M_3,\cdots,M_J$ are commuting diagonalizable matrices, you can perform the above construction for each $M_j$ in order to obtain eigenvalues $\lambda_{j,1},\lambda_{j,2},\cdots,\lambda_{j,K_{j}}$ and polynomials $p_{j,1},p_{j,2},\cdots,p_{j,K_j}$ for each $1 \le j \le J$. Because the $M_j$ commute, then the same is true of all of the $p_{j,k}(M_j)$. Now form all of the distinct products $$ P_{k_1,k_2,\cdots,k_J}=p_{1,k_1}(M_1)p_{2,k_2}(M_2)\cdots p_{L,k_J}(M_J). $$ The sum of all such products is $I$, and every such $P$ is a projection. Discard the products that turn out to $0$. Because the order of the factors may be rearranged without changing $P$, it follows that $$ (M_{j}-\lambda_{j,k_{j}}I)P_{k_1,k_2,\cdots,k_J}=0,\;\;\; 1 \le j \le J. $$ So there are non-zero projections $Q_{1},Q_{2},\cdots,Q_{m}$ whose sum is $I$, whose products are $0$ for distinct factors, and such that every $M_{j}$ is a scalar multiple of the identity on the range of a given $Q_{k}$. Choose a basis for each of the ranges of $Q_{j}$. Combining these bases produces a basis with respect to which each $M_j$ has a diagonal representation.