Thinking in terms of directional derivatives might be more instructive in this case, as we can arrive at the chain rule formulation in a constructive fashion.
Let us consider the directional derivative of a multivariate function $f : \mathbb{R^n} \rightarrow \mathbb{R}$, in an arbitrary direction $u \in \mathbb{R^n}$.
Since $u$ is a direction, we shall assume $||\textbf{u}|| = 1$. The directional derivative in the direction of $\textbf{u}$, is then defined as
$$\nabla_u f = \nabla f \cdot \textbf{u}$$
This can easily be proved by noting, that under the usual limit definition of a derivative, we have
$$\nabla_u f (\textbf{x}) = \lim_{h\rightarrow0} \frac{f(\textbf{x} + h\textbf{u}) - f(\textbf{x})}{h }$$
Since we assume differentiability of the objective function $f$, we can find a linear approximant of $f$ around any arbitrary point $\textbf{a}$ that is close to the true value of $f(\textbf{x})$ in any $\epsilon$-neighborhood of $\textbf{a}$
$$f(\textbf{x}) = f(\textbf{a}) + \nabla f(\textbf{a})^T \cdot (\textbf{x} - \textbf{a})$$
Plugging this approximant into our previous limit formulation we get, for any $\textbf{a}$
$$\nabla_u f (\textbf{a}) = \nabla f(\textbf{a}) \cdot \textbf{u}$$
Going back to the original problem, when differentiating the cost function $E_n$ by an arbitrary parameter $a_k$ we also need to take into account the perturbations induced by a change in $a_k$ in any other parameters it interacts with. Specifically, when altering $a_k$ we shall also move along the $\textbf{direction}$ of the perturbed parameters.
This direction essentially encapsulates all the changes caused in intermediate variables $\{a_k\}_{k=1}^{n}$ that appear when altering a certain $a_j$. As such, for every $a_k$ that is directly influenced by $a_j$, we can measure the actual change by evaluating $\frac{\partial a_k}{\partial a_j}$.
Let $\textbf{p}$ be the aforementioned vector, therefore we can write
$$\textbf{p}_k = \bigg(\frac{\partial a_k}{\partial a_j}\bigg)$$
where $k$ runs through all parameters $a_k$ such that $a_k $ is influenced by $a_j$.
Coupling the directional derivative with the previously described concept, we arrive at the desired result, namely
$$\frac{\partial E_n}{\partial a_j} = \nabla E_n ^T \cdot \textbf p = \sum_{k \in \mathcal{S}} \frac{\partial E_n}{\partial a_k} \frac{\partial a_k}{\partial a_j}$$
where $\mathcal{S}$ is the set of all indices corresponding to variables directly influenced by $a_j$.
After some hours of research I've found a few sites which altogether answer these questions.
Regarding items 1 and 2, it looks like there is indeed a severe abuse of notation every time the author refers to function $h$. This function seems to be the so-called self-information and it is usually defined over probability events or random variables as well. I find this article very clarifying in this respect.
Regarding item 4, for what I have seen, it seems that under certain conditions that the self information functions must satisfy, the logarithm if the only possible choice. The selected answer in this post was particularly useful, and also the comments on the question. This topic is also discussed here, but I prefer the previous link.
Finally, I have not found an answer for item 3. Actually, I really think that this step is wrongly formulated due to the imprecision in the definition of function $h$. Nevertheless, the links I have provided as an answer to item 4 lead to the desired result.
Best Answer
You are not making a mistake, you just need to go one step further. Fist, note that $\mathbf{m}_{N}=\beta \mathbf{A}^{-1} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t}$ with $\mathbf{A} = \alpha I + \beta \boldsymbol{\Phi}^{T}\boldsymbol{\Phi}$. Having that in mind we can start by working your expression out
$$ \frac{\partial E\left(\mathbf{m}_{N}\right)}{\partial \alpha}=\left\{\beta \boldsymbol{\Phi}^{T}\left(\boldsymbol{\Phi} \mathbf{m}_{N}-\mathbf{t}\right)+\alpha \mathbf{m}_{N}\right\}^{T} \frac{\partial \mathbf{m}_{N}}{\partial \alpha}+\frac{1}{2} \mathbf{m}_{N}^{T} \mathbf{m}_{N} $$
Now, if we take a closer look, we can find that:
$$ \left\{\beta \boldsymbol{\Phi}^{T}\left(\boldsymbol{\Phi} \mathbf{m}_{N}-\mathbf{t}\right)+\alpha \mathbf{m}_{N}\right\}^{T} \frac{\partial \mathbf{m}_{N}}{\partial \alpha} = \left\{ {\beta \boldsymbol{\Phi}^{T}\boldsymbol{\Phi}\mathbf{m}_{N} + \alpha \mathbf{m}_{N} - \beta \boldsymbol{\Phi}^{T}\mathbf{t}} \right\}\frac{\partial \mathbf{m}_{N}}{\partial \alpha} $$
which is the same as $\left\{ {\mathbf{A}\mathbf{m}_{N} - \beta \boldsymbol{\Phi}^{T}\mathbf{t}} \right\}\frac{\partial \mathbf{m}_{N}}{\partial \alpha} = \left\{ \beta \mathbf{A}\mathbf{A}^{-1} \mathbf{\Phi}^{\mathrm{T}} \mathbf{t} - \beta \boldsymbol{\Phi}^{T}\mathbf{t}\right\} \frac{\partial \mathbf{m}_{N}}{\partial \alpha}= 0$. This means that $\frac{\partial E\left(\mathbf{m}_{N}\right)}{\partial \alpha}=\frac{1}{2} \mathbf{m}_{N}^{T} \mathbf{m}_{N}$