Uniform Distribution – How to Test for Uniformity in Python

distributionspythonuniform distribution

I have recently started learning about distributions and hypothesis testing in statistics and implementing them in Python. I am trying to write a class that helps tests for uniformity of a pandas Series object coming from (may be) a pandas DataFrame object.

This is how my code looks like.

from dataclasses import dataclass

import pandas as pd
from scipy import stats as scipy_stats


@dataclass
class UniformDistributionTest:
    """
    Test if the passed series is uniformly distributed using different tests.

    returns
        - True if test pass,
        - False if test fails,
        - None if test can not be performed
    """

    s: pd.Series = None
    result: dict = None
    default_significance_level: float = 0.1
    expected_min: float = None
    expected_max: float = None
    scale_factor: float = None

    def kolmogorov_smirnov_uniformity_test(self, low: float = None, high: float = None):
        low = low or self.expected_min
        high = high or self.expected_max

        # Using the parameters loc and scale, one obtains the uniform distribution on [loc, loc + scale].

        _stats, p = scipy_stats.kstest(self.s, scipy_stats.uniform(loc=low, scale=high - low).cdf)

        return p, p > self.default_significance_level

    def chi_square_uniformity_test(self):
        pass


if __name__ == '__main__':

    for i in range(3, 9):
        minimum = 10
        maximum = 100
        size = 10 ** i
        scale = maximum - minimum
        data = scipy_stats.uniform.rvs(loc=minimum, scale=scale, size=size)
        uni_test = UniformDistributionTest(s=pd.Series(data), expected_min=minimum, expected_max=maximum)

        print(10 ** i, uni_test.kolmogorov_smirnov_uniformity_test())

Unfortunately, I'm not able to get this correct for all the inputs, as the responses, starting from an array of size 10^3 to 10^9 returns False about 20% of the times at alpha=10%significance level.

Output of one of the runs

1000 (0.2669051357073732, True)
10000 (0.8986229977153088, True)
100000 (0.4625349246171656, True)
1000000 (0.6925252938298581, True)
10000000 (0.08914203172400792, False)
100000000 (0.6769095300387612, True)

What am I doing wrong?

Best Answer

Your code's logic looks fine.

I'm guessing that you were hoping to see that all the p-values be larger than 10%? Well, it's not made explicit nearly enough in statistics courses, but under the null hypothesis (i.e. if your data is generated according to the null hypothesis, as yours is), the computed p-value is uniformly distributed between 0 and 1! Don't get confused here: the fact that you chose the uniform distribution to test isn't relevant to the above fact. You could have chosen a normal, binomial, etc., but the fact remains that the computed p-value is uniformly distributed between 0 and 1.

Let's compute this. I'm going to fix the sample size, for simplicity, and run the simulation 2000 times, and create a histogram of the resulting p-values:

p_values = []
for _ in range(2000):
    minimum = 10
    maximum = 100
    size = 10 ** 4
    scale = maximum - minimum
    data = scipy_stats.uniform.rvs(loc=minimum, scale=scale, size=size)
    uni_test = UniformDistributionTest(s=pd.Series(data), expected_min=minimum, expected_max=maximum)
    p, _ = uni_test.kolmogorov_smirnov_uniformity_test()

    p_values.append(p)

Looks pretty uniform!

What's fun is that we can use your class to test if these p-values do not come from a uniform distribution:

test = UniformDistributionTest(s=pd.Series(p_values), expected_min=0, expected_max=1)

test.kolmogorov_smirnov_uniformity_test()
# (0.17885557613106573, True)

Looks like we can't reject the null hypothesis that these are not uniformly distributed!

Related Solutions

Hypothesis Testing – How to Test Uniformity in Multiple Dimensions?

It turns out that the question is more difficult than I thought. Still, I did my homework and after looking around, I found two methods in addition to Ripley's functions to test uniformity in several dimensions.

I made an R package called unf that implements both tests. You can download it from github at https://github.com/gui11aume/unf. A large part of it is in C so you will need to compile it on your machine with R CMD INSTALL unf. The articles on which the implementation is based are in pdf format in the package.

The first method comes from a reference mentioned by @Procrastinator (Testing multivariate uniformity and its applications, Liang et al., 2000) and allows to test uniformity on the unit hypercube only. The idea is to design discrepancy statistics that are asymptotically Gaussian by the Central Limit theorem. This allows to compute a $\chi^2$ statistic, which is the basis of the test.

library(unf)
set.seed(123)
# Put 20 points uniformally in the 5D hypercube.
x <- matrix(runif(100), ncol=20)
liang(x) # Outputs the p-value of the test.
[1] 0.9470392

The second approach is less conventional and uses minimum spanning trees. The initial work was performed by Friedman & Rafsky in 1979 (reference in the package) to test whether two multivariate samples come from the same distribution. The image below illustrates the principle.

uniformity

Points from two bivariate samples are plotted in red or blue, depending on their original sample (left panel). The minimum spanning tree of the pooled sample in two dimensions is computed (middle panel). This is the tree with minimum sum of edge lengths. The tree is decomposed in subtrees where all the points have the same labels (right panel).

In the figure below, I show a case where blue dots are aggregated, which reduces the number of trees at the end of the process, as you can see on the right panel. Friedman and Rafsky have computed the asymptotic distribution of the number of trees that one obtains in the process, which allows to perform a test.

non uniformity

This idea to create a general test for uniformity of a multivariate sample has been developed by Smith and Jain in 1984, and implemented by Ben Pfaff in C (reference in the package). The second sample is generated uniformly in the approximate convex hull of the first sample and the test of Friedman and Rafsky is performed on the two-sample pool.

The advantage of the method is that it tests uniformity on every convex multivariate shape and not only on the hypercube. The strong disadvantage, is that the test has a random component because the second sample is generated at random. Of course, one can repeat the test and average the results to get a reproducible answer, but this is not handy.

Continuing previous R session, here is how it goes.

pfaff(x) # Outputs the p-value of the test.
pfaff(x) # Most likely another p-value.

Feel free to copy/fork the code from github.

Uniform Distribution – Measure for the Uniformity of a Distribution

First, note that your terminology is inconsistent. Here I take it that you have one variable (not several) consisting of a fixed number of categories and you are concerned with how categories with zero frequency or probability (not value) are handled.

Your $H$ is evidently $\sum p_i\ \text{log}_2\ (1/p_i)$ for probabilities or proportions $p_i$. The base used for logarithms does not affect any key principle here so we can think that we are summing terms $p_i\ \text{log}\ (1/p_i) = -p_i\ \text{log}\ p_i$.

The counter-argument to your worry is that entropy does take into account categories that have zero probability; it is just that they contribute zero to the entropy given that a strong convention that $-0\ \text{log}\ 0$ is evaluated as 0. A more informal version of the same argument is that the diversity or non-uniformity of what you do have in your collection is unaffected by what you don't have. If I have 10 elephants, spelling out that I have 0 giraffes or do not have any giraffes is incidental: what I have are 10 elephants. Any other statement about 0 frequencies adds no information (literally).

The same question of how to handle zero proportions arises with any measure. An alternative to entropy is based on squaring probabilities $\sum p_i^2$ and with such measures there is the same consequence that any $p_i$ that is 0 makes no difference to the sum.

You touch on a much more general issue of what can be inferred about a distribution from a summary measure. But any single summary measure is a irreversible reduction; you can't go back to the distribution unequivocally. This is on all fours with the point made in elementary statistics that a mean or correlation can reflect quite different data.

I suspect that the main issue here is that you are seeking a way to make entropy more intuitive and that is a legitimate concern. An easy way is to talk in terms of the "numbers equivalent". Calculate $2^H$ for your examples and you recover 5 for 10,10,10,10,10 and 1 for 10,0,0,0,0, which have the interpretation as the equivalent number of (equally common) categories that are present. For other examples, the result will be a non-integer, which is reasonable. For bases 10 or $e$, use $10^H$ or $\exp(H)$ to get the numbers equivalent.

P.S. I try to avoid asserting that something is meaningless unless I am totally sure that it is. I have found too often that I just didn't understand the argument.

EDIT 2016: If you know that (e.g.) 4 and only 4 categories are possible in principle, but only 3 occur, then that's pertinent information. Sometimes you know this: e.g. if cards can be $\{$spades, hearts, clubs, diamonds$\}$ and only some of those kinds occur, that's something to cite.

A measure of diversity that does take zeros into consideration, and is affected by whether zeros occur, has various names (e.g. dissimilarity index) and has general form $(1/2) \sum_{i=1}^S | p_i - q_i | =: D$ (say). Here $p_i$ is the observed proportion of category $i$ and $q_i$ is the proportion in a reference distribution, e.g. equal probabilities $q_i = 1/S$. Then the minimum occurs when the observed distribution is identical to the reference distribution and then $D = 0$. The maximum occurs when one proportion $p_i$ is $1$ and the others all zero. The achievable maximum depends on the number of categories $S$, which after all is part of the information. The concrete interpretation of $D$ is the minimum proportion that would need to change categories to reproduce the reference distribution.

Another example of a reference distribution would be the national distribution of different socio-economic classes or ethnic categories. Then $D = 0$ might mean that a local or regional community is a microcosm of the national and otherwise $D$ measures departure from that in some direction.

Best Answer

Related Solutions

Hypothesis Testing – How to Test Uniformity in Multiple Dimensions?

Uniform Distribution – Measure for the Uniformity of a Distribution

Related Question