Machine Learning Softmax – What is the Role of Temperature in Softmax?

machine learningneural networkssoftmax

I'm recently working on CNN and I want to know what is the function of temperature in the softmax formula? and why should we use high temperatures to see a softer norm in probability distribution?

The formula can be seen below:

$$\large P_i=\frac{e^{\frac{y_i}T}}{\sum_{k=1}^n e^{\frac{y_k}T}}$$

Best Answer

The temperature is a way to control the entropy of a distribution, while preserving the relative ranks of each event.


If two events $i$ and $j$ have probabilities $p_i$ and $p_j$ in your softmax, then adjusting the temperature preserves this relationship, as long as the temperature is finite:

$$p_i > p_j \Longleftrightarrow p'_i > p'_j$$


Heating a distribution increases the entropy, bringing it closer to a uniform distribution. (Try it for yourself: construct a simple distribution like $\mathbf{y}=(3, 4, 5)$, then divide all $y_i$ values by $T=1000000$ and see how the distribution changes.)

Cooling it decreases the entropy, accentuating the common events.

I’ll put that another way. It’s common to talk about the inverse temperature $\beta=1/T$. If $\beta = 0$, then you've attained a uniform distribution. As $\beta \to \infty$, you reach a trivial distribution with all mass concentrated on the highest-probability class. This is why softmax is considered a soft relaxation of argmax.