Solved – Implementing chi-square in python and testing on scipy’s poisson and norm variates

chi-squared-testhypothesis testingpythonscipy

I need to implement Pearson's chi-squared test to test random variates. But I get very different results using different sequence length,degrees of freedom or even seeds. Only a few times sequences pass the test (for demonstration purposes tests are on scipy's generators, not custom). Is my implementation wrong or is it the behaviour of chi-squared test? It looks quite strange that I often fail tests for scipy's functions (and custom too), and the effect by changing seed suprises me too.

Added: (clarifying this for people who don't read Python)

Some examples of my implementation output:

For normal distribution generated by scipy with mean 10 and scale Chi2 statistic is 1.48554564247, when critical value is 1.14547622606 (significance alpha=0.05, degrees of freedom = 5, sample size = 200000)

For poisson distribution generated by scipy with mean 10 Chi2 statistic is 3.08213050263,
while critical value is 1.14547622606 (significance alpha=0.05, degrees of freedom = 5, sample size = 100000)

And I also tried with a much smaller sample size = 2000:

For poisson distribution generated by scipy with mean 10 Chi2 statistic is 24.5618663076
while critical value is 11.5913052088 (significance alpha=0.05, degrees of freedom = 21)

For n degrees of freedom bins are selected in form of 2 bins (-Infinity;min observed value), (max observed value;+Infinity) and n-2 equal intervals beetween previous two intervals.

UPDATE: Know I think that I make decision to reject using wrong values. F.e. if I have a significance value alpha = 0.05, should I use chi2 percent point function with arguments of alpha and dof or 1-alpha and dof? I think I was mislead by russian Wikipedia page notation

import numpy as np
from scipy.stats import chi2
from scipy.stats import chisquare

def getBins(xmin,xmax,n_bins):
    r = np.linspace(xmin,xmax,num=n_bins+1,endpoint = True)
    r = r+10**(-10) # including rightmost
    r[0]=r[0]-2*10**(-10) # excluding xmin from (-Inf;xmin] bin
    return np.concatenate((np.array([float('-inf')]), r, np.array([float('inf')])))

# Calculates probabilities for each bin (a,b] within given cumulative distribution function 
def piCalcDecoratorNew(bins, *args):
    def real_piCalcDecorator(cdfFunc):
        def piCalc(*args):
            piA = np.zeros(len(bins)-1)
            if len(args)==1:
                args = args[0]
                piA[0] =cdfFunc(bins[1],args)  
                piA[-1] =1-cdfFunc(bins[-2],args)
                for i in range(1,len(bins)-2):
                    piA[i]=cdfFunc(bins[i+1],args)-cdfFunc(bins[i],args)
            else: #number of params >1
                piA[0] =cdfFunc(bins[1],*args)            
                piA[-1] =1-cdfFunc(bins[-2],*args)
                for i in range(1,len(bins)-2):
                    piA[i]=cdfFunc(bins[i+1],*args)-cdfFunc(bins[i],*args)
            return piA  
        return piCalc
    return real_piCalcDecorator

# similar to scipy's chisquare()
def chi2statistic(obs = np.array([16, 18, 16, 14, 12, 12], dtype='float'), exp = np.array([16, 16, 16, 16, 16, 8],dtype='float')):
    temp = np.square(obs-exp,dtype='float')
    with np.errstate(divide='ignore',invalid='ignore'):
        temp = temp / exp
        temp[exp == 0] = 0
    return sum(temp)
    # like return chisquare(obs,exp)

def chi2test(df,x, alpha,cdfFunc,*args):
    N = len(x)
    xmin = min(x)
    xmax = max(x)
    bins = getBins(xmin,xmax,df-2)
    print "Bins for histogram are "
    print bins
    piCalc = piCalcDecoratorNew(bins,*args)(cdfFunc)
    piks = piCalc(*args)
    print "Expected probability to be in a bin"
    print piks
    a = piks*float(N)
    b = np.histogram(x,bins)[0]
    print "Observed probabilities for bins"
    print b/float(N)
    print "Chi2 statistic is {0}".format(chi2statistic(b,a))
    print "Critical value is {0}".format(chi2.ppf(alpha,df))
    return (chi2statistic(b,a),chi2.ppf(alpha,df))

And here's some of my tests and outputs:

# Testing on scipy's norm
from scipy.stats import norm
alpha = 0.05
test_sequence = norm.rvs(loc=10.0, scale=2.0, size=100000, random_state=42)
print chi2test(5,test_sequence,alpha,norm.cdf,10.0,2.0)


Bins for histogram are 
[        -inf   1.06879227   7.03191768  12.99504309  18.9581685
          inf]
Expected probability to be in a bin
[  3.99216127e-06   6.88950087e-02   8.63972197e-01   6.71250536e-02
   3.74819718e-06]
Observed probabilities for bins
[ 0.       0.06836  0.86407  0.06757  0.     ]
Chi2 statistic is 1.48554564247
Critical value is 1.14547622606
(1.4855456424733853, 1.1454762260617695)

# Testing on scipy's poisson
from scipy.stats import poisson
alpha = 0.05
test_sequence = poisson.rvs(mu=10, size=100000, random_state=42)
print chi2test(5,test_sequence,alpha,poisson.cdf,10.0)


Bins for histogram are 
[            -inf  -1.00000000e-10   9.00000000e+00   1.80000000e+01
   2.70000000e+01              inf]
Expected probability to be in a bin
[  0.00000000e+00   4.57929714e-01   5.34883781e-01   7.18425107e-03
   2.25353405e-06]
Observed probabilities for bins
[ 0.       0.45658  0.53583  0.00759  0.     ]
Chi2 statistic is 3.08213050263
Critical value is 1.14547622606
(3.0821305026304886, 1.1454762260617695)


# Testing on scipy's poisson with other parameters
from scipy.stats import poisson
alpha = 0.05
test_sequence = poisson.rvs(mu=10, size=300000, random_state=42)

print chi2test(max(test_sequence),test_sequence,alpha,poisson.cdf,10.0)


Bins for histogram are 
[            -inf  -1.00000000e-10   1.08000000e+00   2.16000000e+00
   3.24000000e+00   4.32000000e+00   5.40000000e+00   6.48000000e+00
   7.56000000e+00   8.64000000e+00   9.72000000e+00   1.08000000e+01
   1.18800000e+01   1.29600000e+01   1.40400000e+01   1.51200000e+01
   1.62000000e+01   1.72800000e+01   1.83600000e+01   1.94400000e+01
   2.05200000e+01   2.16000000e+01   2.26800000e+01   2.37600000e+01
   2.48400000e+01   2.59200000e+01   2.70000000e+01              inf]
Expected probability to be in a bin
[  0.00000000e+00   4.99399227e-04   2.26999649e-03   7.56665496e-03
   1.89166374e-02   3.78332748e-02   6.30554580e-02   9.00792257e-02
   1.12599032e-01   1.25110036e-01   1.25110036e-01   1.13736396e-01
   9.47803301e-02   1.24985051e-01   3.47180696e-02   2.16987935e-02
   1.27639962e-02   7.09110899e-03   3.73216263e-03   1.86608131e-03
   8.88610150e-04   4.03913704e-04   1.75614654e-04   7.31727725e-05
   2.92691090e-05   1.54267384e-05   2.25353405e-06]
Observed probabilities for bins
[  0.00000000e+00   5.03333333e-04   2.24000000e-03   7.73666667e-03
   1.89433333e-02   3.75066667e-02   6.30300000e-02   9.04766667e-02
   1.13150000e-01   1.24163333e-01   1.25440000e-01   1.13123333e-01
   9.52500000e-02   1.25460000e-01   3.40800000e-02   2.17266667e-02
   1.27300000e-02   7.00333333e-03   3.82000000e-03   1.97000000e-03
   9.30000000e-04   4.46666667e-04   1.56666667e-04   6.00000000e-05
   4.33333333e-05   1.00000000e-05   0.00000000e+00]
Chi2 statistic is 20.8855527231
Critical value is 16.1513958497
(20.885552723131507, 16.151395849664102)

Best Answer

With a large sample size like 10000, the law of large numbers holds already pretty well and the distribution of test statistics should be a good approximation. So the rejection rate in repeated simulations should be close to alpha.

If it doesn't work out that way, then it is most likely a bug in the program. The numpy/scipy random numbers have been verified years ago by a very similar chisquare test and should not have any bugs. Bugs with just tiny incorrect numbers or bugs in corner or extreme cases are never ruled out. But there are no bugs that would be visible with this kind of chisquare test.