Uniform Distribution – How to Test for Uniformity in Python

distributionspythonuniform distribution

I have recently started learning about distributions and hypothesis testing in statistics and implementing them in Python. I am trying to write a class that helps tests for uniformity of a pandas Series object coming from (may be) a pandas DataFrame object.

This is how my code looks like.

from dataclasses import dataclass

import pandas as pd
from scipy import stats as scipy_stats


@dataclass
class UniformDistributionTest:
    """
    Test if the passed series is uniformly distributed using different tests.

    returns
        - True if test pass,
        - False if test fails,
        - None if test can not be performed
    """

    s: pd.Series = None
    result: dict = None
    default_significance_level: float = 0.1
    expected_min: float = None
    expected_max: float = None
    scale_factor: float = None

    def kolmogorov_smirnov_uniformity_test(self, low: float = None, high: float = None):
        low = low or self.expected_min
        high = high or self.expected_max

        # Using the parameters loc and scale, one obtains the uniform distribution on [loc, loc + scale].

        _stats, p = scipy_stats.kstest(self.s, scipy_stats.uniform(loc=low, scale=high - low).cdf)

        return p, p > self.default_significance_level

    def chi_square_uniformity_test(self):
        pass


if __name__ == '__main__':

    for i in range(3, 9):
        minimum = 10
        maximum = 100
        size = 10 ** i
        scale = maximum - minimum
        data = scipy_stats.uniform.rvs(loc=minimum, scale=scale, size=size)
        uni_test = UniformDistributionTest(s=pd.Series(data), expected_min=minimum, expected_max=maximum)

        print(10 ** i, uni_test.kolmogorov_smirnov_uniformity_test())

Unfortunately, I'm not able to get this correct for all the inputs, as the responses, starting from an array of size 10^3 to 10^9 returns False about 20% of the times at alpha=10%significance level.

Output of one of the runs

1000 (0.2669051357073732, True)
10000 (0.8986229977153088, True)
100000 (0.4625349246171656, True)
1000000 (0.6925252938298581, True)
10000000 (0.08914203172400792, False)
100000000 (0.6769095300387612, True)

What am I doing wrong?

Best Answer

Your code's logic looks fine.

I'm guessing that you were hoping to see that all the p-values be larger than 10%? Well, it's not made explicit nearly enough in statistics courses, but under the null hypothesis (i.e. if your data is generated according to the null hypothesis, as yours is), the computed p-value is uniformly distributed between 0 and 1! Don't get confused here: the fact that you chose the uniform distribution to test isn't relevant to the above fact. You could have chosen a normal, binomial, etc., but the fact remains that the computed p-value is uniformly distributed between 0 and 1.

Let's compute this. I'm going to fix the sample size, for simplicity, and run the simulation 2000 times, and create a histogram of the resulting p-values:

p_values = []
for _ in range(2000):
    minimum = 10
    maximum = 100
    size = 10 ** 4
    scale = maximum - minimum
    data = scipy_stats.uniform.rvs(loc=minimum, scale=scale, size=size)
    uni_test = UniformDistributionTest(s=pd.Series(data), expected_min=minimum, expected_max=maximum)
    p, _ = uni_test.kolmogorov_smirnov_uniformity_test()

    p_values.append(p)

2000 p-values

Looks pretty uniform!

What's fun is that we can use your class to test if these p-values do not come from a uniform distribution:

test = UniformDistributionTest(s=pd.Series(p_values), expected_min=0, expected_max=1)

test.kolmogorov_smirnov_uniformity_test()
# (0.17885557613106573, True)

Looks like we can't reject the null hypothesis that these are not uniformly distributed!