It seems to me that before much more progress can be made in the calculus of ${}^xy$, more fundamental questions have to be answereed, such as, how to define ${}^xy$ for rational $x$? It's clear how the OP's definition works if $x$ is a non-negative integer; but how do we define ${}^xy$ if, say, $x = 7/2$? What then is "one-half" of an occurrance of $x$ in the exponential "tower" which is supposed to be ${}^xy$?
I am reminded here of the way $x^y$ is extended from integers through the reals, by starting with a careful, consistent and believable definition of $(p / q)^{(r / s)}$ for integral $p, q, r, s$; once we have that, a simple, consistent and believable continuity argument allows us to accept a definition of $x^y$ for real $x, y > 0$. We know what $(p / q)^r = (p^r / q^r)$ means; we know what it means for a positive real $z$ to satisfy $z^s = (p / q)^r$, so we can get a handle on $(p / q)^{(r / s)}$ from which, by continuity, we can generalize to $x^y$. I think an analogous method is needed here, but I don't know what it is. But I think my question of the preceding paragraph might be worth considering early on in this game.
Of course, perhaps there is a (reasonably) simple, consistent and believable argument to contruct ${}^xy$ using $\exp()$, $\log()$, etc., or some sort of differential or similar equation ${}^xy$ must satisfy, or perhaps one could learn something from the $\Gamma$ function and factorials here which would bypass, at least temporarily, the need to address how ${}^{(p / q)}(r / s)$ is supposed to work, but sooner or later the question will have to be faced, I'll warrant.
This is an interesting, though speculative, arena and I am glad to have participated. But until I can answer my own questions to my better satisfaction, I will refrain from further
remarks, except to bid those who are ready to climb such unknown heights, "Excelsior!
Hope this helps, at least with the spirit of the adventure if not with the direction. Happy New Year,
and as always,
Fiat Lux!!!
Your derivatives $\large \frac{\partial p_j}{\partial o_i}$ are indeed correct, however there is an error when you differentiate the loss function $L$ with respect to $o_i$.
We have the following (where I have highlighted in $\color{red}{red}$ where you have gone wrong)
$$\frac{\partial L}{\partial o_i}=-\sum_ky_k\frac{\partial \log p_k}{\partial o_i}=-\sum_ky_k\frac{1}{p_k}\frac{\partial p_k}{\partial o_i}\\=-y_i(1-p_i)-\sum_{k\neq i}y_k\frac{1}{p_k}({\color{red}{-p_kp_i}})\\=-y_i(1-p_i)+\sum_{k\neq i}y_k({\color{red}{p_i}})\\=-y_i+\color{blue}{y_ip_i+\sum_{k\neq i}y_k({p_i})}\\=\color{blue}{p_i\left(\sum_ky_k\right)}-y_i=p_i-y_i$$ given that $\sum_ky_k=1$ from the slides (as $y$ is a vector with only one non-zero element, which is $1$).
Best Answer
What you are doing wrong is assuming that you can apply the "product rule" and "chain rule" to matrix differentiation as you're thinking about it, as is stated in the article here.
There is a "product rule" and "chain rule" that work in this context. However, understanding them requires that you acknowledge that the derivative of $h(a)$ is not simply a scalar-valued function on matrices; rather, at each $a$, $h'(a)$ is a linear functional on matrices, which can be represented nicely as a matrix with the correct choice of dual basis.