Putting $A$ in block form $(A_{ij})$, commutativity means $M_iA_{ij}=A_{ij}M_j$ for all $i$ and $j$, and you want to show that $i\neq j$ implies $A_{ij}=0$. Note that $M_iA_{ij}=A_{ij}M_j$ implies $p(M_i)A_{ij}=A_{ij}p(M_j)$ for every polynomial $p$. In particular, there is an $n$ such that $(M_j-\lambda_j I)^n=0$, so $(M_i-\lambda_j I)^nA_{ij}=0$. Since $M_i-\lambda_j I$ is invertible (when $i\neq j$), this implies that $A_{ij}=0$.
There's an approach to these problems that I largely attribute to Artin's Algebra which is: when evaluating whether a claim is true or not, try to evaluate with some very simple representatives. Frequently you'll find a counterexample. (Alternatively, if true, you can frequently bootstrap a general proof from a simple one.) Diagonal matrices are of course easiest, but beyond that
(i.) upper triangular matrices
(ii.) Hermitian matrices are very easy to work with and have properties you can easily exploit.
(iii.) block diagonal matrices are also quite nice.
This motivates a very useful technique: work out some $2\times 2$ and perhaps $3\times 3$ examples then use them in simple block structures to create more complicated examples. In general block diagonal matrices of the form
$A= \left[\begin{matrix}Y & \mathbf 0\\\mathbf 0 & Z\end{matrix}\right]$
where ($Y$ and $Z$ are square) are extremely easy to work with. There are underlying interpretations in terms of direct sums (and direct products) that can be helpful. I trust you know that the characteristic polynomial splits along the blocks, i.e.
$\det\big(\lambda I -A\big)=\det\big(\lambda I -Y\big)\cdot \det\big(\lambda I -Z\big)$
(iv.) involutions and idempotence also have very nice properties, useful elsewhere
For this post, every nilpotent matrix is similar to a strictly upper triangular matrix, so use them as a representatives in $2-4$ (and even 5).
The $2\times 2$ case is simplest so start with that. Here we only have representatives $\mathbf 0$ and $N = \left[\begin{matrix}0 & \alpha\\0 & 0\end{matrix}\right]$ for some $\alpha \neq 0$. These are obviously not similar, justification: (a) the conjugacy class of the zero matrix only contains the zero matrix or (b) they have different ranks so can't be similar.
(1) requires some tinkering though I'd attack this in terms of either diagonal matrices or use (iii).
(2) Consider re-using the the $2\times 2$ case (and trivial $1\times 1$ case) in blocks $\left[\begin{matrix}\mathbf 0 & \mathbf 0\\\mathbf 0 & 0\end{matrix}\right]$ and $\left[\begin{matrix}N & \mathbf 0\\\mathbf 0 & 0\end{matrix}\right]$. Same rationale for why they can't be similar.
(3) since $4=2\cdot 2$, consider the $2\times 2$ case, twice, via a block matrix $A:= \left[\begin{matrix}N & \mathbf 0\\\mathbf 0 & N\end{matrix}\right]$ and $A':= \left[\begin{matrix}N & \mathbf 0\\\mathbf 0 & \mathbf 0\end{matrix}\right]$
$\text{rank } A =2 \gt 1 =\text{rank } A'$ so they aren't similar but they have the same minimal and characteristic polynomials.
note: you said
I had no idea how to quickly guess a matrix with a desired minimal or characteristic polynomial
This is one of the advantages of the block diagonal structure. You learn the minimal (and characteristic) polynomials for simple (e.g. $2\times 2$) cases then infer the result for some more complicated matrix living in a higher dimensional vector space. i.e.
$p\big(A\big):= \left[\begin{matrix}p\big(N\big) & \mathbf 0\\\mathbf 0 & p\big(N\big)\end{matrix}\right]$
so it's immediate that $A$ has the same minimal polynomial as $N$. And you can check that $A'$ has same min polynomial -- anything of lower degree wouldn't annihilate $N$ on its leading block.
(4) Make using of Hermicity so $Z: =N^*N$ is a product of two nilpotent matrices but e.g. Hermitian matrices are diagonalizable and the only diagonalizable nilpotent matrix is the zero matrix, but $\text{rank}\big(N^*N\big)=\text{rank}\big(N\big)=1\gt 0$, etc. There are many ways to finish, but the big idea is you can convert an awkward matrix into a diagonalizable one via Hermicity.
(5) a simple approach is to recognize that you have been getting hit with nilpotent matrices so use them. Again with $2\times 2$ case we have $N(N) =N^2 = \mathbf 0$ but $N\neq \mathbf 0$ so $N$ kills $N\cdot V \neq \mathbf 0$ which tells you something about $\ker N \cap \text{im } N$.
Best Answer
Block-matrix multiplication is a very useful tool which, unfortunately, is rarely explained well. Here's the way I think about it.
Recall that if we choose a basis for a vector space $V$, any linear transformation has an associated matrix (whose entries are numbers) with respect to this choice of basis. Similarly, if we break a vector space into a direct sum $V = V_1 \oplus \cdots \oplus V_n$, then a block matrix specifies that transformation with respect to this decomposition into a direct sum.
Definitions of a direct sum of vector spaces vary, but I mean the following. We say that $V = V_1 \oplus \cdots \oplus V_n$ ($V$ is the direct sum of $V_1,\dots,V_n$) if $V$ is the vector space whose elements are the column-vectors $$ v = (v_1,\dots,v_n) = \pmatrix{v_1 \\ \hline v_2 \\ \hline \vdots \\ \hline v_n}, \quad v_i \in V_i, \ i = 1,\dots,n. $$ The horizontal lines in the second version are simply a notational convenience, used to emphasize the separation between the cells of our column vector. It will often be convenient to refer to an element of the form $v = (0,\dots,0,v,0,\dots,0)$. To that end, we will use the notation $v \otimes e_i:= (0,\dots,0,v,0,\dots,0)$, where $i$ is the index of the non-zero entry. So, we could write $$ (v_1,\dots,v_n) = v_1 \otimes e_1 + v_2 \otimes e_2 + \cdots + v_n \otimes e_n. $$ (this notation is ultimately motivated by the definition of the tensor product, but explaining this is not necessary for the purposes of this discussion)
Now, suppose that $T:V \to W$ is a linear map, with $V = \bigoplus_{i=1}^n V_i$, $W = \bigoplus_{i=1}^m W_i$. We see that for any $j$, $T(v \otimes e_j)$ is a linear map. With that established, we see that there must exist maps $T_{1j},\dots,T_{nj}$ such that $$ T(v \otimes e_j) = (T_{1j}(v),T_{2j}(v),\dots,T_{mj}(v)). $$ With the maps $T_{ij}$ defined in this fashion, we see that, by the definition of linearity, we have $$ T(v_1,\dots,v_n) = \pmatrix{T_{11}(v_1) + \cdots + T_{1n}(v_n)\\ \vdots \\ T_{m1}(v_1) + \cdots + T_{mn}(v_n)} := \pmatrix{T_{11} & \cdots & T_{1n}\\ \vdots & \ddots & \vdots \\ T_{m1} & \cdots & T_{mn}} \pmatrix{v_1\\ \vdots \\ v_n}. $$ On the right side above, we have introduced a new notation: we have a matrix whose entries are linear transformation. By performing the indicated "matrix multiplication", we obtain the desired result, namely $T(v_1,\dots,v_n)$. With that in mind, we say that the block-matrix representation of $T$ (with respect to the decompositions of $V,W$) is given by $$ T = \pmatrix{ T_{11} & \cdots & T_{1n}\\ \vdots & \ddots & \vdots\\ T_{m1} & \cdots & T_{mn} } = \left( \begin{array}{c|c|c} T_{11} & \cdots & T_{1n}\\ \hline \vdots & \ddots & \vdots\\ \hline T_{m1} & \cdots & T_{mn} \end{array} \right). $$ Now, suppose that we have maps $T:V \to W$ and $S: W \to Z$, where $V,W$ are as above and $Z = Z_1 \oplus \cdots \oplus Z_p$. We can now see that for any vector $(v_1,\dots,v_n) \in V$, we have $$ (S \circ T)(v) = S(Tv) = S \left[ \pmatrix{T_{11} & \cdots & T_{1n}\\ \vdots & \ddots & \vdots \\ T_{m1} & \cdots & T_{mn}} \pmatrix{v_1\\ \vdots \\ v_n} \right]\\ = \pmatrix{S_{11} & \cdots & S_{1m}\\ \vdots & \ddots & \vdots \\ T_{p1} & \cdots & T_{pm}} \left[ \pmatrix{T_{11} & \cdots & T_{1n}\\ \vdots & \ddots & \vdots \\ T_{m1} & \cdots & T_{mn}} \pmatrix{v_1\\ \vdots \\ v_n} \right] \\ = \left[\pmatrix{S_{11} & \cdots & S_{1m}\\ \vdots & \ddots & \vdots \\ T_{p1} & \cdots & T_{pm}} \pmatrix{T_{11} & \cdots & T_{1n}\\ \vdots & \ddots & \vdots \\ T_{m1} & \cdots & T_{mn}} \right] \pmatrix{v_1\\ \vdots \\ v_n}, $$ where we define the product of two block matrices in the "expected way". On the other hand, $$ (S \circ T)(v) = \pmatrix{(S \circ T)_{11} & \cdots & (S \circ T)_{1n}\\ \vdots & \ddots & \vdots \\ (S \circ T)_{m1} & \cdots & (S \circ T)_{mn}}\pmatrix{v_1\\ \vdots \\ v_n}. $$ So, it must be the case that multiplying the block-matrix of $S$ and the block-matrix of $T$ gives us the block-matrix of $S \circ T$.
For your particular example, $M$ represents a map from $\Bbb R^{k + d} = \Bbb R^k \oplus \Bbb R^d$ to $\Bbb R^n$. The block-matrix of $M$ is given by $[A\ \ B]$, which is to say that $$ M \pmatrix{v_1\\v_2} = \pmatrix{A & B} \pmatrix{v_1\\v_2} = Av_1 + Bv_2. $$ On the other hand, the block matrix of $M'$ is $\left[\begin{smallmatrix}A'\\B'\end{smallmatrix}\right]$, which is to say that for $w \in \Bbb R^n$ we have $$ M'w = \pmatrix{A'\\B'}w = \pmatrix{A'w\\ B'w}. $$ With that established, the block matrix of $M' \circ M$ is the product of the block matrices, $$ M'M = \pmatrix{A'\\B'} \pmatrix{A & B} = \pmatrix{A'A & A'B\\ B'A & B'B}. $$ Indeed, if you agree with my description of the maps corresponding to $M$ and $M'$, then for $v = (v_1,v_2)$ we have $$ M'M v = M'(Mv) = M' (Av_1 + Bv_2) = \pmatrix{A'(Av_1 + Bv_2)\\ B'(Av_1 + Bv_2)}\\ = \pmatrix{A'A v_1 + A'B v_2\\ B'A v_1 + B'B v_2}. $$