Information Theory – Appropriateness of Using ‘Bits’ for Log-Base-2 Likelihood Ratio

information theorylikelihood-ratioterminology

I'm quite enamoured with likelihood ratios as a means of quantifying relative evidence in scientific endeavours. However, in practice I find that the raw likelihood ratio can get unprintably large, so I've taken to log-transforming them, which has the nice side-benefit of representing evidence for/against the denominator in a symmetric fashion (i.e. the absolute value of the log likelihood ratio represents the strength of evidence and the sign indicates which model, the numerator or denominator, is the supported model). Now, what choice of logarithm base? Most likelihood metrics use log-base-e, but this strikes me as a not very intuition-friendly base. For a while I used log-base-10, which apparently was dubbed the "ban" scale by Alan Turing and has the nice property that one can easily discern relative orders of magnitude of evidence. It recently occurred to me that it might be useful also to employ log-base-2, in which case I thought it might be appropriate to use the term "bit" to refer to the resulting values. For example, a raw likelihood ratio of 16 would transform to 4 bits of evidence for the denominator relative to the numerator. However, I wonder if this use of the term "bit" violates its conventional information theoretic sense. Any thoughts?

Best Answer

I think it's perfectly well justified. (In fact, I've use this convention in papers I've published; or you can call them "nats" if you prefer to stick with logarithms of base $e$).

The justification runs as follows: the log-likelihood of the fitted model can be viewed as a Monte Carlo estimate of the KL divergence between the "true" (unknown) data distribution and the distribution implied by the fitted model. Let $P(x)$ denote the "true" distribution of the data, and let $P_\theta(x)$ denote the distribution (i.e., the likelihood $P(x|\theta))$ provided by a model.

Maximum likelihood fitting involves maximizing

$L(\theta) = \frac{1}{N}\sum_i \log P_\theta(x_i) \approx \int P(x) \log P_\theta(x) dx$

The left hand side (the log-likelihood, scaled by the # datapoints $N$) is a Monte Carlo estimate for the right hand side, i.e., since the datapoints $x_i$ were drawn from $P(x)$. So we can rewrite

$L(\theta) \approx \int P(x) \log P_\theta(x) dx = \int P(x) \log \frac{P_\theta(x)}{P(x)} dx + \int P(x) \log P(x)dx$

$ = -D_{KL}(P,P_\theta) - H(x)$

So the log-likelihood normalized by the number of points is an estimate of the (negative) KL-divergence between $P$ and $P_\theta$ minus the (true) entropy of $x$. The KL divergence has units of "bits" (if we use log 2), and can be understood as the number of "extra bits" you would need to encode data from $P(x)$ using a codebook based on $P_\theta(x)$. (If $P = P_\theta$, you don't need any extra bits, so KL divergence is zero).

Now: when you take the log-likelihood ratio of two different models, it should be obvious that you end up with:

$\log \frac{P_{\theta_1(x)}}{P_{\theta_2}(x)} \approx D_{KL}(P,P_{\theta_2}) - D_{KL}(P,P_{\theta_1})$

The entropy $H(x)$ terms cancel. So the log-likelihood ratio (normalized by $N$) is an estimate of the difference between the KL divergence of the true distribution and the distribution provided by model 1, and the true distribution provided by model 2. It's therefore an estimate of the number of "extra bits" you need to code your data with model 2 compared to coding it with model 1. So I think the "bits" units are perfectly well justified.

One important caveat: when using this statistic for model-comparison, you should really use LLR computed on cross-validated data. The log-likelihood of training data is generally artificially high (favoring the model with more parameters) due to overfitting. That is, the model assigns this data higher probability than it would if it were fit to an infinite set of training data and then evaluated at the points $x_i \dots x_N$ in your dataset. So the procedure many people follow is to:

  1. train models 1 and 2 using training data;

  2. evaluate the log-likelihood ratio of a test dataset and report the resulting number in units of bits as a measure of the improved "code" provided by model 1 compared to model

The LLR evaluated on training data would generally give an unfair advantage to the model with more parameters / degrees of freedom.