The categorical distribution is the minimum assumptive distribution over the support of "a finite set of mutually exclusive outcomes" given the sufficient statistic of "which outcome happened". In other words, using any other distribution would be an additional assumption. Without any prior knowledge, you must assume a categorical distribution for this support and sufficient statistic. It is an exponential family. (All minimum assumptive distributions for a given support and sufficient statistic are exponential families.)
The correct way to combine two beliefs based on independent information is the pointwise product of densities making sure not to double-count prior information that's in both beliefs. For an exponential family, this combination is addition of natural parameters.
The expectation parameters are the expected values of $x_k$ where $x_k$ are the number of times you observed outcome $k$. This is the right parametrization for converting a set of observations to a maximum likelihood distribution. You simply average in this space. This is what you want when you are modeling observations.
The multinomial logistic function is the conversion from natural parameters to expectation parameters of the categorical distribution. You can derive this conversion as the gradient of the log-normalizer with respect to natural parameters.
In summary, the multinomial logistic function falls out of three assumptions: a support, a sufficient statistic, and a model whose belief is a combination of independent pieces of information.
As whuber points out, $\delta_{ij}$ is the Kronecker delta https://en.wikipedia.org/wiki/Kronecker_delta :
$$
\begin{align}
\delta_{ij} = \begin{cases}
0\: \text{when } i \ne j \\
1 \: \text{when } i = j
\end{cases}
\end{align}
$$
... and remember that a softmax has multiple inputs, a vector of inputs; and also gives a vector output, where the length of the input and output vectors are identical.
Each of the values in the output vector will change if any of the input vector values change. So the output vector values are each a function of all the input vector value:
$$
y_{k'} = f_{k'}(a_1, a_2, a_3,\dots, a_K)
$$
where $k'$ is the index into the output vector, the vectors are of length $K$, and $f_{k'}$ is some function. So, the input vector is length $K$ and the output vector is length $K$, and both $k$ and $k'$ take values $\in \{1,2,3,...,K\}$.
When we differentiate $y_{k'}$, we differentiate partially with respect to each of the input vector values. So we will have:
- $\frac{\partial y_{k'}}{\partial a_1}$
- $\frac{\partial y_{k'}}{\partial a_2}$
- etc ...
Rather than calculating individually for each $a_1$, $a_2$ etc, we'll just use $k$ to represent the 1,2,3, etc, ie we will calculate:
$$
\frac{\partial y_{k'}}{\partial a_k}
$$
...where:
- $k \in \{1,2,3,\dots,K\}$ and
- $k' \in \{1,2,3\dots K\}$
When we do this differentiation, eg see https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/ , the derivative will be:
$$
\frac{\partial y_{k'}}{\partial a_k} = \begin{cases}
y_k(1 - y_{k'}) &\text{when }k = k'\\
- y_k y_{k'} &\text{when }k \ne k'
\end{cases}
$$
We can then write this using the Kronecker delta, which is simply for notational convenience, to avoid having to write out the 'cases' statement each time.
Best Answer
To my knowledge there is no deeper reason, apart from the fact that a lot of the people who took ANNs beyond the Perceptron stage were physicists.
Apart from the mentioned benefits, this particular choice has more advantages. As mentioned, it has a single parameter that determines the output behaviour. Which in turn can be optimized or tuned in itself.
In short, it is a very handy and well known function that achieves a kind of 'regularization', in the sense that even the largest input values are restricted.
Of course there are many other possible functions that fulfill the same requirements, but they are less well known in the world of physics. And most of the time, they are harder to use.