[Math] Meaning of inverse temperature

statistics

I am not familiar with statistics, so when I read a book which covers statistical learning, I have a question. Here, a posteriori probability density function is defined as follows;

$D_n=\{X_1,X_2,\cdots, X_n \}$ be a set of random variables and $\varphi(w)$ be a priori probability density function. Then for given statistical model $p(x|w)$,
The "a posteriori probability density function $p(w|D_n)$ with the inverse temperature $\beta>0$" is defined by
$$p(w|D_n)={1\over {Z_n}} \varphi(w) \prod_{i=1}^n p(X_i|w)^\beta,$$
where $Z_n$ is the normalizing factor.

My question is what is the meaning of "inverse temperature"? I want to know its meaning or role in this definition.
Any reference would be helpful, thanks!

Best Answer

This name refers to an analogy with statistical mechanics. In statistical mechanics there is a general statement of the following rough form. Suppose a system can be in states $i \in I$ with energies $E_i$, and that it is in "thermal equilibrium" (it doesn't really matter for the moment what this means). Then the probability that the system is in state $i$ is proportional to $\exp (- \beta E_i)$, where

$$\beta = \frac{1}{kT}$$

(thermodynamic beta) is inversely proportional to the temperature of the system; this probability distribution is known as the Boltzmann distribution or Gibbs distribution. One elegant way to characterize it is as the maximum entropy distribution over states with a particular average energy.

This lets us say some rough qualitative things about what the system looks like at high temperatures / low $\beta$ as opposed to at low temperatures / high $\beta$:

At high temperatures / low $\beta$ the numbers $\exp ( - \beta E_i)$ are all roughly equal, so the system is roughly equally likely to be in any of its possible states (think of boiling water).
At low temperatures / high $\beta$ the function $\exp ( -\beta E)$ is a strongly decreasing function of $E$; it is largest when $E$ is smallest, and for other values of $E$ it will be much smaller. So the system spends most of its time in its lowest energy states, or ground states (think of ice).

The proportionality constant / normalizing factor in the "proportional to $\exp (-\beta E_i)$" above is the partition function

$$Z(\beta) = \sum_{i \in I} \exp(-\beta E_i)$$

which encodes a lot of important information about the system; most basically,

$$ - \frac{Z'(\beta)}{Z(\beta)} = \langle E \rangle$$

is the average energy of the system at inverse temperature $\beta$, and we can also compute the variance of the energy, etc.

I'm not the best qualified person to discuss the precise relationship to statistical learning, but in any case, here's a start: given any positive probability distribution $p_i > 0$ on any set $I$, you can fit it into a Boltzmann distribution by just defining

$$E_i = - \log p_i$$

and so considering a family of new distributions where the probability of $i$ is proportional to $\exp ( - \beta E_i) = p_i^{\beta}$. In other words, you imagine that your original probability distribution was the distribution of states of a system at thermodynamic equilibrium at $\beta = 1$, and you ask what happens to the system as the temperature goes up or down from this. As above, as $\beta \to 0$ this has the effect of evening out the probabilities and as $\beta \to \infty$ this has the effect of making the most likely possibilities even more likely.

You can also now do some fun things like compute the partition function and do calculations with it; here the average energy at $\beta = 1$ has the interpretation of the entropy of the original distribution.

If you tell me what the book is I might be able to say more. Based on what you've quoted (which doesn't quite fit into the idea I just described unless the prior is uniform), here's a guess: when $\beta = 1$ this is just the usual posterior distribution implied by Bayes' theorem. For different values of $\beta$ you are effectively either deliberately under-updating (when $\beta < 1$) or over-updating (when $\beta > 1$) on the evidence. This is easiest to see when $\beta$ is an integer: when $\beta = 0$ you are not updating on the evidence at all and sticking to your prior, and when $\beta = n \ge 2$ you are effectively pretending to have seen each piece of evidence $n$ (independent) times.

Best Answer

Related Solutions

[Math] Bayesian Interpretation for Ridge Regression and the Lasso

Confusion about the definition of the Fisher information for discrete random variables

Related Question