Bernoulli$^*$ cross-entropy loss is a special case of categorical cross-entropy loss for $m=2$.
$$
\begin{align}
\mathcal{L}(\theta)
&= -\frac{1}{n}\sum_{i=1}^n\sum_{j=1}^m y_{ij}\log(p_{ij}) \\
&= -\frac{1}{n}\sum_{i=1}^n \left[y_i \log(p_i) + (1-y_i) \log(1-p_i)\right]
\end{align}
$$
Where $i$ indexes samples/observations and $j$ indexes classes, and $y$ is the sample label (binary for LSH, one-hot vector on the RHS) and $p_{ij}\in(0,1):\sum_{j} p_{ij} =1\forall i,j$ is the prediction for a sample.
I write "Bernoulli cross-entropy" because this loss arises from a Bernoulli probability model. There is not a "binary distribution." A "binary cross-entropy" doesn't tell us if the thing that is binary is the one-hot vector of $k \ge 2$ labels, or if the author is using binary encoding for each trial (success or failure). This isn't a general convention, but it makes clear that these formulae arise from particular probability models. Conventional jargon is not clear in that way.
Step back and frame the problem more generally.
Let P = probability matrix, where $P_{ij}$ = probability of assigning an item in truth class i to class j. All rows of P sum to 1.
Create a matrix-valued cost function $C(P)$, where $C_{ij}$ = cost incurred due to having probability $P_{ij}$ of assigning an item in truth class $i$ to class $j$. Cost function elements on the diagonal are 0 when the corresponding $P$ entry = 1, and costs on the off-diagonal are 0 when the corresponding $P$ entry = 0. I.e., $C_{ii}(P) = 0$ if $P_{ii}$ = 1; and for $i \ne j$, $C_{ij}(P) = 0$ if $P_{ij}$ = 0.
The objective is to minimize $\Sigma_i \Sigma_j C_{ij}(P)$. The sum of off-diagonal costs need not be equal across rows if the truth classes are of unequal importance. So this framework allows weighting by truth class and by into what class an item is misclassified.
Your proposed loss function's lack of agnosticism is due to nonlinearity of your cost function, and in this case, driving to equality due to the convexity of -log.
If you use linear cost elements, you will achieve agnosticism. For instance,the diagonal elements $C_{ii}$ of $C(P)$ could be $c_{ii}∗(1−Pii)$, and off-diagonal elements $C_{ij}$ could be $c_{ij}∗Pij$. If you adopt this linear structure, all you have to do is specify the $c_{ij}$ 's for all i and j.
You could still achieve agnosticism by using this linear cost structure for off-diagonal elements, while having nonlinear costs for diagonal elements of $C(P)$. If you use linear cost structure for off-diagonal elements, with all off-diagonal $c_{ij}$ for a given i equal to a common (i.e., unweighted by misclassification class) value $c_i$, and choose $C_{ii}(P) = -log(P_{ii}) - (1-P{ii}) (M - 1)c_i$, then this reduces to the standard cross-entropy loss (i'm not worrying about the factor 1/N).
Best Answer
Suppose there's a random variable $Y$ where $Y \in \{0,1\}$ (for binary classification), then the Bernoulli probability model will give us:
$$ L(p) = p^y (1-p)^{1-y} $$
$$ log(L(p)) = y\log p + (1-y) \log (1-p) $$
Its often easier to work with the derivatives when the metric is in terms of log and additionally, the min/max of loglikelihood is the same as the min/max of likelihood. The inherent meaning of a cost or loss function is such that the more it deviates from the 0, the worse the model performs. The negative sign the just preserves that meaning and is easier to interpret. Maximizing the above function will lead to the same result.