The main things going on in this question are multivariate calculus and linear algebra. With a solid understanding of those, most of the math behind machine learning techniques falls into place, give or take the statistical view that some people have on the subject.
I agree the indices aren't great. The problem is solvable with them with careful bookkeeping. Take every step slowly, and take small steps. What the problem is asking you to do is differentiate the error function with respect to the weights. The set of equations they ask for are precisely what you find when you set that gradient equal to $0$.
To make that explicit, consider just the partial with respect to $w_1$: $$\begin{aligned}\frac{\partial E}{\partial w_1}&=\frac{\partial}{\partial w_1}\left[\frac12\sum_{n=1}^N(y(x_n,w)-t_n)^2\right]\\&=\frac12\sum_{n=1}^N\frac{\partial}{\partial w_1}(y(x_n,w)-t_n)^2\\&=\frac12\sum_{n=1}^N2(y(x_n,w)-t_n)\frac{\partial y}{\partial w_1}(x_n, w)\\&=\frac12\sum_{n=1}^N2(y(x_n,w)-t_n)x_n^1\\&=\sum_{n=1}^N(y(x_n,w)-t_n)x_n^1\\&=\sum_{n=1}^N\left(\sum_{j=0}^Mw_jx_n^j-t_n\right)x_n^1\end{aligned}$$
With a little cajoling, when you set that partial equal to $0$ you get the second coordinate of the linear equations the textbook asks for. What happens when you repeat the exercise for arbitrary $w_a$ instead of just $w_1$?
Just because this point is rarely brought up, I want to point out that indices are indeed a large part of the confusion in a problem like this. Especially when you get to the point where you would like to simplify things or speed calculations, having to keep track of all that bookkeeping is a nightmare. The same set of equations pops out when we take derivatives with respect to vectors or matrices instead though. Define $$X=\begin{pmatrix}x_0^0&\cdots&x_0^M\\\vdots&\ddots&\vdots\\x_N^0&\cdots&x_N^M\end{pmatrix},$$ and $$t=\begin{pmatrix}t_0\\\vdots\\t_N\end{pmatrix},$$ and $$w=\begin{pmatrix}w_0\\\vdots\\w_M\end{pmatrix}.$$ Note that your error function can be succinctly expressed as $$E(w)=\frac12(Xw-t)^T(Xw-t).$$
To take the derivative, we proceed almost as if we were performing standard 1D calculus, and skipping to the solution, we find $$\frac{\partial E}{\partial w}=X^TXw-X^Tt.$$
Setting this equal to $0$ we recover the equations the author asks for
$$X^TXw=X^Tt.$$ Better yet, in the event that $X$ is invertible (which can unfortunately only happen if $N=M$, but happens almost always if we do have $N=M$) then this problem simplifies even further to $$Xw=t,$$ a simplification we were almost certain to miss when wading through indices.
After some hours of research I've found a few sites which altogether answer these questions.
Regarding items 1 and 2, it looks like there is indeed a severe abuse of notation every time the author refers to function $h$. This function seems to be the so-called self-information and it is usually defined over probability events or random variables as well. I find this article very clarifying in this respect.
Regarding item 4, for what I have seen, it seems that under certain conditions that the self information functions must satisfy, the logarithm if the only possible choice. The selected answer in this post was particularly useful, and also the comments on the question. This topic is also discussed here, but I prefer the previous link.
Finally, I have not found an answer for item 3. Actually, I really think that this step is wrongly formulated due to the imprecision in the definition of function $h$. Nevertheless, the links I have provided as an answer to item 4 lead to the desired result.
Best Answer
I think you are there. Maybe slightly rewriting your last equation helps:
$$ A_{i1} w_1 + A_{i2} w_2 + ... + (A_{ii} + \lambda) w_i + ... + A_{iM} w_M = T_i $$
and you have $M$ of these equations.