Solved – Why does the L2 norm heuristic work in measuring uniformity of probability distributions

distributionsheuristicuniform distribution

To start off, please go through this question regarding measuring non-uniformity in probability distributions.

Among several good answers, user495285 has suggested a heuristic of simply taking the L2 norm of a vector whose values add to 1. I've found in my experiments that it actually works very well for most practical applications and is usually performant.

I am now in the process of writing a paper about a machine learning algorithm, which makes use of this heuristic. However, I need to provide at least some background explaining why this heuristic works reasonably well.

Given that taking L2 norms for this purpose is probably just a heuristic, I understand that there might not be a solid theoretical basis, but I nevertheless need an intuition on what could be going on, so that I can at least explain it in the paper and I'm myself clear regarding what's going on. Ideally, if there's a proper explanation available that I can directly cite, kindly share it here.

I looked up the web and could find some documents that talk about using L2 norms in context of measuring uniformity, but I'm not sure if they give an intuitive explanation of why it works and if they are citable. Here are the documents:

  1. Testing Uniformity of Distributions
  2. Ratio and Difference of l1 and l2 Norms and Sparse
    Representation with Coherent Dictionaries
  3. Sublinear time algorithms

In addition, if you have any additional ideas on how to measure non-uniformity in distributions or you could tell why some measure is better than others, please let me know.

Best Answer

I believe the intended application is that

  • frequencies $f_i$ of $n$ items $i=1,2,\ldots, n$ have been observed; and

  • you are wondering whether these frequencies are consistent with an underlying uniform distribution in which all observations are (a) independent and (b) equally probable.

The "vector" in question is the normalized tuple of relative frequencies,

$$p = (p_1, p_2, \ldots, p_n) = \left(\frac{f_1}{f}, \frac{f_2}{f}, \ldots, \frac{f_n}{f}\right)$$

with $f = f_1 + f_2 + \cdots + f_n$ being the total number of observations. The $L_2$ norm of $p$ is, by definition, the square root of

$$||p||_2^2 = p_1^2 + p_2^2 + \cdots + p_n^2.$$

Using this as a measure of uniformity of the $p_i$ has an intuitive mathematical justification--it does get larger as the $p_i$ vary more--but it has no immediate statistical justification. Let's see whether we can put it on a solid footing.

To this end, notice that the average value of the $p_i$ is $\bar p = 1/n$ (because they sum to unity and there are $n$ of them). Uniformity doesn't really refer to the actual values of the $p_i$: it refers to how they vary around their expected value. Let us therefore compute that variation and try to relate it to the $L_2$ norm. A well-known algebraic result (easily proven) is

$$||p||_2^2 = \sum_i \left(p_i - \frac{1}{n}\right)^2 + \frac{1}{n}.$$

Now the number of observations $f$ must play a critical role, for without that information we have no good idea how variable the observed frequencies ought to be. It is natural to introduce a factor of $f^2$ in order to clear the denominators in the $p_i = f_i/f$:

$$f^2||p||_2^2 = \sum_i \left(f p_i - \frac{f}{n}\right)^2 + \frac{f^2}{n} = \sum_i \left(f_i - \frac{f}{n}\right)^2 + \frac{f^2}{n}.\tag{1}$$

This is immediately recognizable as almost equal to the chi-squared statistic for a test of uniformity: the expected frequencies ($E_i$) are $f/n$ while the observed frequencies ($O_i$) are $f_i$. This statistic, by definition, is the sum of standardized differences,

$$\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i} = \sum_i \frac{(f_i - f/n)^2}{f/n}.$$

Let us therefore divide $(1)$ by $f/n$ to introduce $\chi^2$:

$$n f ||p||_2^2 = \sum_i \frac{(f_i - f/n)^2}{f/n} + f = \chi^2 + f.$$

Finally we may isolate a statistically meaningful expression:

$$f\left(n||p||_2^2 - 1\right) = \chi^2.$$

This shows that up to an affine transformation determined by the number of categories $n$ and total number of observations $f$, the square of the $L_2$ norm of the relative frequency vector is a standard statistic used to measure uniformity of a frequency distribution. That's why the $L_2$ norm may be of value.

But why not just use the $\chi^2$ statistic in the first place?

Related Question