I have recently started learning about distributions and hypothesis testing in statistics and implementing them in Python. I am trying to write a class that helps tests for uniformity of a pandas Series object
coming from (may be) a pandas DataFrame
object.
This is how my code looks like.
from dataclasses import dataclass
import pandas as pd
from scipy import stats as scipy_stats
@dataclass
class UniformDistributionTest:
"""
Test if the passed series is uniformly distributed using different tests.
returns
- True if test pass,
- False if test fails,
- None if test can not be performed
"""
s: pd.Series = None
result: dict = None
default_significance_level: float = 0.1
expected_min: float = None
expected_max: float = None
scale_factor: float = None
def kolmogorov_smirnov_uniformity_test(self, low: float = None, high: float = None):
low = low or self.expected_min
high = high or self.expected_max
# Using the parameters loc and scale, one obtains the uniform distribution on [loc, loc + scale].
_stats, p = scipy_stats.kstest(self.s, scipy_stats.uniform(loc=low, scale=high - low).cdf)
return p, p > self.default_significance_level
def chi_square_uniformity_test(self):
pass
if __name__ == '__main__':
for i in range(3, 9):
minimum = 10
maximum = 100
size = 10 ** i
scale = maximum - minimum
data = scipy_stats.uniform.rvs(loc=minimum, scale=scale, size=size)
uni_test = UniformDistributionTest(s=pd.Series(data), expected_min=minimum, expected_max=maximum)
print(10 ** i, uni_test.kolmogorov_smirnov_uniformity_test())
Unfortunately, I'm not able to get this correct for all the inputs, as the responses, starting from an array of size 10^3 to 10^9 returns False about 20% of the times at alpha=10%
significance level.
Output of one of the runs
1000 (0.2669051357073732, True)
10000 (0.8986229977153088, True)
100000 (0.4625349246171656, True)
1000000 (0.6925252938298581, True)
10000000 (0.08914203172400792, False)
100000000 (0.6769095300387612, True)
What am I doing wrong?
Best Answer
Your code's logic looks fine.
I'm guessing that you were hoping to see that all the p-values be larger than 10%? Well, it's not made explicit nearly enough in statistics courses, but under the null hypothesis (i.e. if your data is generated according to the null hypothesis, as yours is), the computed p-value is uniformly distributed between 0 and 1! Don't get confused here: the fact that you chose the uniform distribution to test isn't relevant to the above fact. You could have chosen a normal, binomial, etc., but the fact remains that the computed p-value is uniformly distributed between 0 and 1.
Let's compute this. I'm going to fix the sample size, for simplicity, and run the simulation 2000 times, and create a histogram of the resulting p-values:
Looks pretty uniform!
What's fun is that we can use your class to test if these p-values do not come from a uniform distribution:
Looks like we can't reject the null hypothesis that these are not uniformly distributed!