$
\def\o{{\tt1}}\def\p{\partial}
\def\L{\left}\def\R{\right}
\def\LR#1{\L(#1\R)}
\def\BR#1{\Big(#1\Big)}
\def\bR#1{\big(#1\big)}
\def\vecc#1{\operatorname{vec}\LR{#1}}
\def\diag#1{\operatorname{diag}\LR{#1}}
\def\Sym#1{\operatorname{Sym}\!\BR{#1}}
\def\sym#1{\operatorname{Sym}\LR{#1}}
\def\Diag#1{\operatorname{Diag}\LR{#1}}
\def\trace#1{\operatorname{Tr}\LR{#1}}
\def\qiq{\quad\implies\quad}
\def\grad#1#2{\frac{\p #1}{\p #2}}
\def\m#1{\left[\begin{array}{r}#1\end{array}\right]}
\def\c#1{\color{red}{#1}}
$First, define some new variables
$$\eqalign{
B &= \tfrac 12D \\
y_k &= Mx_k \\
X &= \m{x_1&x_2&\ldots&x_n}\;\in{\mathbb R}^{m\times n} \\
Y &= \m{y_1&y_2&\,\ldots&\,y_n}\;= MX \qiq \c{dY = dM\,X} \\
J &=X\oslash X \qiq J_{ij}=\o\quad\big({\rm all\!-\!ones\;matrix}\big) \\
X &= J\odot X \\
{\cal D}_Y
&= {\rm Diag}\BR{\!\vecc{Y}}\,\in{\mathbb R}^{mn\times mn} \\
}$$
where $\{\odot,\oslash\}$ denote elementwise/Hadamard multiplication and division.
Then rewrite the problem in pure matrix notation
and calculate its differential.
$$\eqalign{
2B &= (X\odot Y)^TJ + J^T(X\odot Y) - X^TY - Y^TX \\
&= 2\,\Sym{J^T(X\odot{Y}) - X^T{Y}} \\
dB &= \Sym{J^T(X\odot{\c{dY}}) - X^T{\c{dY}}} \\
&= \Sym{J^T(X\odot\LR{\c{dM\,X}}) - X^T{\c{dM\,X}}} \\
}$$
where $\;\sym{A} \doteq \tfrac 12\LR{A+A^T}$
At this point, one can calculate a component-wise gradient
$$\eqalign{
\grad{B}{M_{ij}}
&= \Sym{J^T(X\odot\LR{E_{ij}\,X}) - X^T{E_{ij}\,X}} \\
}$$
where $E_{ij}$ is a single-entry matrix whose $(i,j)$ element is $\o$ and all others are zero.
In terms of the original variable
$$\eqalign{
\grad{B}{M_{ij}} &= \frac 12 \LR{\grad{D}{M_{ij}}} \\
}$$
The problem with the requested gradient $\LR{\grad DM}$ is that it's a fourth-order tensor, which cannot be written in standard matrix notation.
Update
You've added a constraint to your original question, i.e.
$$M=A^TA \qiq \c{dM=dA^TA+A^TdA}$$
The procedure to find the new gradient is straightforward. Write the differential, change the independent variable from $P\to A$, then isolate the gradient.
Here is the calculation for the component-wise gradient
$$\eqalign{
dB &= \Sym{J^T(X\odot\LR{\c{dM\,X}}) - X^T\LR{\c{dM\,X}}} \\
&= \Sym{J^T(X\odot\LR{\c{dA^TAX+A^TdA\,X}}
- X^T\LR{\c{dA^TAX+A^TdA\,X}}} \\
&= 2 \Sym{J^T\bR{\!\LR{AX}\odot\LR{dA\,X}\!} - X^TA^TdA\,X} \\
\grad{B}{A_{ij}}
&= 2 \Sym{J^T\bR{\!\LR{AX}\odot\LR{E_{ij}X}\!}-X^TA^TE_{ij}X} \\
}$$
The linked paper is pretty good when compared to most
Machine Learning papers, but the mathematics is still far
too "hand-wavy". It's geared towards an audience who just want to code the gradient expression directly into their Python programs without bothering to check the math.
Deriving their gradient result using matrix notation would be a herculean effort. Even converting their result into matrix notation would be very difficult, because the nested sum violates the usual summation convention since it contains an expression with four $i$-indices and four $j$-indices. Conventional index notation only permits two of any single index.
If you want someone to decode that mess for you then, as Steph suggested, you should post a new question $-$ I'll let him answer that one.
In the proof that you have written, only the properties of inner product and norm are used, which hold for any Hilbert space. So, as you have already pointed out, the proof carries straightly to a Hilbert space.
When the data is not linearly separable, a specific kernel function $K(\cdot, \cdot)$ is used to map the data to a Hilbert space, where the linear separability is more likely to hold. The proofs of convergence uses the properties of norms and inner products, and hence applicable to any Hilbert space.
Note that the Euclidean space $\mathbb{R}^d$ is a finite dimensional Hilbert space. The principal difference between $\mathbb{R}^d$ and an arbitrary Hilbert space, in terms of validity of proofs of any mathematical result, is the property of compactness. In $\mathbb{R}^d$, any closed and bounded set is compact, but it is not true for an infinite dimensional Hilbert space. As long as a proof or derivation in $\mathbb{R}^d$ does not use the property of compact sets, it will carry straight to a Hilbert space.
Using the validity of the kernel SVM, it has already been extended to the case where the observations are elements of an infinite dimensional Hilbert space, e.g., random functions. For example, see the papers here.
Best Answer
Different authors use the term "kernel" differently. In the machine learning community, some similarity measures are called kernels, such as $$k(x,y) = \begin{cases} 1 , \text{ if } \|x-y\| \le \epsilon \\ 0 , \text{ else} \end{cases}$$ that is popular in spectral graph clustering. However, by choosing a data set $\cal X = \{0,1,2\}$ and $ \epsilon = 1$, you end up with a kernel matrix that has negative eigenvalues, so this kernel is not PSD.
On the other hand, mathematics tend to use kernels as a synonym for PSD kernels. Mercer's theorem guarantees that then, and only then, does there exist a Hilbert space $\cal H$ in which the inner product is described by $k$, in the sense that there exist map $\phi$ such that $$\forall x,y \in \cal X: \langle \phi(x) , \phi(y) \rangle_\cal H = k(x,y)$$
Now to your kernel: It is not PSD. Take $\cal X = \{0,1\}$ and check the resulting kernel matrix.