Define the radial vectors $\,w_k = (x - x_k)\in{\mathbb R}^n\,$ and note their differentials are identical $\,dw_k = dx$.
Concatenate the vectors into a single matrix
$$\eqalign{
\Omega&=\big[\,w_1\;w_2\;\ldots\;w_m\big]&\in&{\mathbb R}^{n\times m} \\
d\Omega&=\big[\,dx\;dx\;\ldots\;dx\big] &= &dx\,{\tt\large 1}^T
}$$
Note that the $\,r_j=\|w_j\|\,$ are themselves elements of a vector
$\,r\in{\mathbb R}^m$.
The cartesian base vectors $\,e_k\in{\mathbb R}^m$ allow us to write all of the above as
$$w_k=\Omega\,e_k,\quad dx=d\Omega\,e_k,\quad r_j=e_j^Tr$$
Applying the RBF function element-wise avoids summations and subscripts. Simply write the derivatives and differentials as
$$\eqalign{
&g=\phi(r),\quad g'=\phi'(r),\quad g''=\phi''(r)\;&\in{\mathbb R}^m \\
&dg=g'\odot dr,\quad dg'=g''\odot dr \;&\in{\mathbb R}^m \\
}$$
where $\odot$ is the elementwise/Hadamard product.
It will also be useful to switch between vectors and diagonal matrices, which will be indicated by an uppercase letter, e.g.
$$\eqalign{
&R={\rm Diag}(r),\quad &G={\rm Diag}(g),\quad &G''={\rm Diag}(g'')\;\in{\mathbb R}^{m\times m} \\
&r = {\rm diag}(R),\quad &g = {\rm diag}(G),\quad &g''=\ldots \\
&r = R{\tt\large 1},\quad &g = G{\tt\large 1},\quad &g''=\ldots \\
&dg = G'dr,\quad &dg' = G''dr \\
}$$
and it will prove convenient to define
$$
P=R^{-1}\quad\implies PR=I,\;\;p\odot r = {\tt\large 1}
$$
Now write down the main relationship and differentiate it.
$$\eqalign{
r\odot r &= {\rm diag}(\Omega^T\Omega) \\
2r\odot dr &= {\rm diag}(\Omega^Td\Omega+d\Omega^T\Omega)
\;=\; 2{\,\rm {diag}}(\Omega^Td\Omega) \\
R\,dr &= {\rm diag}(\Omega^Tdx\,{\tt\large 1}^T) \;=\; \Omega^Tdx \\
dr &= P\Omega^Tdx \\
\frac{\partial r}{\partial x} &= P\Omega^T \\
}$$
A quick check of the $i^{th}$ component recovers Eqn (18) from the paper and confirms that we're on the right path:
$$\eqalign{
e_i^T\bigg(&\frac{\partial r}{\partial x}\bigg) = e_i^TP\Omega^T \\
&\frac{\partial r_i}{\partial x}
\;=\; \frac{1}{r_i}\;e_i^T\Omega^T
\;=\; \frac{w_i^T}{\|w_i\|} \\
}$$
The model function (replacing $\beta$ with the easier to type $b$) is
$$f = b^Tg = b:g$$
where the colon is the Frobenius product notation for the trace, i.e.
$\;A:B = {\rm Tr}(A^TB)$
Calculate the differential and gradient of the model.
$$\eqalign{
df &= b:dg \;=\; b:g'\odot dr \;=\; g'\odot b:dr \\
&= (g'\odot b):P\,\Omega^Tdx \\
&= \Omega P(g'\odot b):dx \\
&= \Omega PG'B{\tt\large 1}:dx \\
J=\frac{\partial f}{\partial x}
&= \Omega PG'B{\tt\large 1}
\;=\; \bigg(\frac{\partial r}{\partial x}\bigg)^T\Big(b\odot g'\Big) \\
}$$
All that hard work produced the Jacobian (gradient). Now on to the Hessian.
$$\eqalign{
dJ
&= d\Omega\,PG'B{\tt\large 1} + \Omega PdG'B{\tt\large 1} + \Omega\,dP\,G'B{\tt\large 1}
\\
&= dx\,{\tt\large 1}^TPG'B{\tt\large 1} + \Omega PB\,dg' - \Omega (P\,dR\,P)G'B{\tt\large 1}
\\
&= dx\,({\tt\large 1}^TPG'B{\tt\large 1}) +\Omega PB\,dg' -\Omega PG'PB\,dr
\\
&= (G':PB)\,dx +\Omega PBG''\,dr -\Omega PG'PB\,dr
\\
&= \Big((G':PB)I +\Omega PB(G'' - PG')P\Omega^T\Big)\,dx
\\
H = \frac{\partial J}{\partial x}
&= (G':PB)I + \Omega PB(G''-PG')P\Omega^T \\
&= \Big((p\odot b):g'\Big)\,I \;+\; \bigg(\frac{\partial r}{\partial x}\bigg)^T\Big(BG''-BPG'\Big)\bigg(\frac{\partial r}{\partial x}\bigg) \\
}$$
Although it may not look like it, this result is identical to Eqn (21). The above form has the advantage that it is obviously symmetric (as it must be) whereas the result in the paper is awkward.
When manipulating the above expressions, keep in mind that the $(R,G,B)$ matrices are all diagonal and therefore commute with one another, however $\,\Omega\,$ is a full matrix and does not commute with the others.
The clumsy notation $H_j$ does not refer to a vector or matrix element, but rather to the fact that $H(x)$ was evaluated at the point $x=x_j$. All of the variables above, except for $(b,1,e_k,B,I)$, are implicitly functions of $x$. So you could slap a $j$ subscript on each of them to calculate the expression for $H_j$.
Best Answer
$ \def\BR#1{\Big(#1\Big)} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\diag#1{\op{diag}\LR{#1}} \def\Diag#1{\op{Diag}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\a{\alpha}\def\b{\beta} \def\l{\lambda}\def\s{\sigma} \def\o{{\tt1}}\def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\hess#1#2{\frac{\p^2 #1}{\p #2^2}} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\A{A^{-1}} \def\S{S^{-1}} $The Frobenius product $(:)$ is extremely useful in Matrix Calculus $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \|A\|^2_F \\ }$$ This is also called the double-dot or double contraction product.
When applied to vectors $(n=\o)$ it reduces to the standard dot product.
The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many fruitful ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A &= \LR{A^TC}:B \\ }$$ It also commutes with the elementwise/Hadamard product $(\odot)$ $$A:\LR{B\odot C} = \LR{A\odot B}:C\\$$
For typing convenience, define the variables (all functions are element-wise) $$\eqalign{ X &= \big[x_1\;\;x_2\;\cdots\;x_n\big] \\ a &= \a\o,\;b=\b,\;w=\frac{\o}{a} \\ z &= X^Tb + \log(a) &\qiq dz = X^Tdb \\ e &= \exp(z) = \a\exp(X^Tb) &\qiq de = e\odot dz \\ s &= \s(z) = \frac{e}{\o+e} &\qiq \big({\rm Logistic\;function}\big) \\ S &= \Diag s &\qiq ds = \LR{S-S^2} dz \\ Y &= \Diag y,\;W\!=\!\Diag w \\ }$$ The derivative of the Logistic function shown above is well known.
Only the first two terms of the log likelihood expression contain $\b,\,$ therefore create a truncated function and calculate its differential and gradient. $$\eqalign{ \l &= y:\log(s) - w:\log(\o+e) \\ d\l &= y:\fracLR{ds}{s} - w:\fracLR{de}{\o+e} \\ &= y:\LR{\S ds} - w:\fracLR{e\odot dz}{\o+e} \\ &= y:\LR{I-S}\c{dz} - w:S\:\c{dz} \\ &=\BR{Iy-Sy-Sw}:\c{X^Tdb} \\ &= X\BR{y-Sy-Sw}:db \\ &= X\BR{y-Ys-Ws}:db \\ g\;=\; \grad{\l}{b} &= X\BR{y-\LR{Y+W}s} \\ }$$ Next, calculate the gradient of $g$, i.e. the Hessian. $$\eqalign{ dg &= -X(Y+W)\:ds \\ &= -X(Y+W)\LR{S-S^2}dz \\ &= -X(Y+W)\LR{S-S^2}X^Tdb \\ H= \hess{\l}{b} \;=\;\grad{g}{b} &= -\c{X}(Y+W)\LR{S-S^2}\c{X^T} \\ }$$ Notice that the terms sandwiched between $X$ and $X^T$ are diagonal matrices, therefore $H$ is symmetric (as it should be).
Some people are overly pedantic and write $(\p b^2)$ as $(\p b\:\p b^T)$ which properly conveys the shape of the Hessian matrix.
As for book recommendations, the standard text is probably Magnus and Neudecker's Matrix Differential Calculus, although personally I prefer Hjorungnes's Complex-Valued Matrix Derivatives.
For simply looking up formulas, if you cannot find it online in Petersen and Pedersen's Matrix Cookbook then consult Bernstein's Matrix Mathematics: Theory, Facts, and Formulas.