Step back and frame the problem more generally.
Let P = probability matrix, where $P_{ij}$ = probability of assigning an item in truth class i to class j. All rows of P sum to 1.
Create a matrix-valued cost function $C(P)$, where $C_{ij}$ = cost incurred due to having probability $P_{ij}$ of assigning an item in truth class $i$ to class $j$. Cost function elements on the diagonal are 0 when the corresponding $P$ entry = 1, and costs on the off-diagonal are 0 when the corresponding $P$ entry = 0. I.e., $C_{ii}(P) = 0$ if $P_{ii}$ = 1; and for $i \ne j$, $C_{ij}(P) = 0$ if $P_{ij}$ = 0.
The objective is to minimize $\Sigma_i \Sigma_j C_{ij}(P)$. The sum of off-diagonal costs need not be equal across rows if the truth classes are of unequal importance. So this framework allows weighting by truth class and by into what class an item is misclassified.
Your proposed loss function's lack of agnosticism is due to nonlinearity of your cost function, and in this case, driving to equality due to the convexity of -log.
If you use linear cost elements, you will achieve agnosticism. For instance,the diagonal elements $C_{ii}$ of $C(P)$ could be $c_{ii}∗(1−Pii)$, and off-diagonal elements $C_{ij}$ could be $c_{ij}∗Pij$. If you adopt this linear structure, all you have to do is specify the $c_{ij}$ 's for all i and j.
You could still achieve agnosticism by using this linear cost structure for off-diagonal elements, while having nonlinear costs for diagonal elements of $C(P)$. If you use linear cost structure for off-diagonal elements, with all off-diagonal $c_{ij}$ for a given i equal to a common (i.e., unweighted by misclassification class) value $c_i$, and choose $C_{ii}(P) = -log(P_{ii}) - (1-P{ii}) (M - 1)c_i$, then this reduces to the standard cross-entropy loss (i'm not worrying about the factor 1/N).
If you agree that a logistic regression is a special case of a neural network, then the answer is D. R. Cox, who invented logistic regression.[1]
A neural network with zero hidden layers and a single sigmoid output and trained to maximize the binomial likelihood (equiv. minimize cross-entropy) is logistic regression.
Minimizing a binomial cross-entropy is equivalent to maximizing a particular likelihood:
the relationship between maximizing the likelihood and minimizing the cross-entropy
[1] D. R. Cox. "The Regression Analysis of Binary Sequences" Journal of the Royal Statistical Society. Series B (Methodological)
Vol. 20, No. 2 (1958), pp. 215-242.
Best Answer
It doesn't have a maximum value. When $y_i=1$, and $\hat{y}_i=0$, the loss is infinite. Or at least, we can say that as predicted probability for the true class goes towards $0$, the loss approaches towards infinity. Since the range of $H(y,\hat{y})$ is $\mathbb{R}_{\geq 0}$, and $\infty \notin \mathbb{R}_{\geq 0}$, we cannot say the function has a maximum (i.e. the maximum value should have been in its range). Probability values of other classes doesn't matter, because the corresponding $y_j$ are $0$.