Machine Learning – Why Use Softmax Function to Calculate Probabilities?

machine learningneural networkssoftmax

Applying the softmax function on a vector will produce "probabilities" and values between $0$ and $1$.

But we can also divide each value by the sum of the vector and that will produce probabilities and values between $0$ and $1$.

I read the answer on here but it says that the reason is because it's differentiable, although Both functions are differentiable.

Best Answer

The function you propose has a singularity whenever the sum of the elements is zero.

Suppose your vector is $[-1, \frac{1}{3}, \frac{2}{3}]$. This vector has a sum of 0, so division is not defined. The function is not differentiable here.

Additionally, if one or more of the elements of the vector is negative but the sum is nonzero, your result is not a probability.

Suppose your vector is $[-1, 0, 2]$. This has a sum of 1, so applying your function results in $[-1, 0, 2]$, which is not a probability vector because it has negative elements, and elements exceeding 1.

Taking a wider view, we can motivate the specific form of the softmax function from the perspective of extending binary logistic regression to the case of three or more categorical outcomes.

Doing things like taking absolute values or squares, as suggested in comments, means that $-x$ and $x$ have the same predicted probability; this means the model is not identified. By contrast, $\exp(x)$ is monotonic and positive for all real $x$, so the softmax result is (1) a probability vector and (2) the multinomial logistic model is identified.

Related Question