Solved – Scipy.stats.anderson_ksamp negative return values for test statistic

anderson darling testp-valuepythonscipy

Ok, so I've been trying to run this test on the the iris dataset to see if it flags the clusters within the data as samples that aren't from the same population.

from sklearn import datasets

iris = datasets.load_iris()
X=iris.data

but when I run the Anderson Darling k-sample test, I get a negative test statistic with a warning message:

stats.anderson_ksamp(X)
(-7.5303855723035387, array([ 0.65422412, 1.29943382, 1.69811439, 2.05150559, 2.47260634]), 1.5192999959017166e-05, array([-0.29565939, -0.84674275, -0.70510477]))

Warning (from warnings module):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/stats/morestats.py", line 1353
warnings.warn("approximate p-value will be computed by extrapolation")
UserWarning: approximate p-value will be computed by extrapolation

I added another return value, pf, that is used to calculate the p-value:

p = math.exp(np.polyval(pf, A2))

where A2 is the test statistic. Now, I know having a negative test statistic in this particular case messes around with the p-value (giving p-values > 1). I also tried running the test on a well-defined cluster within this dataset, samples 0-49 and still got a very negative test statistic.

stats.anderson_ksamp(X[0:49])
(-7.2161038796439101, array([ 0.6374498 , 1.31073023, 1.7353192 , 2.11769803, 2.58073305]), 0.0029686585640793673, array([-0.22692954, -0.92818677, -0.70082687]))

I was wondering if I am performing this test incorrectly, or if I should be using a different test, to check if many samples fall within the same distribution.

Thanks

Best Answer

print out X and compare it to (code blocks not formatting right, so I'm not using them):

Y = [iris.data[:,0],iris.data[:,1],iris.data[:,2],iris.data[:,3]]

You'll see that X is a 150x1 array of float64, and Y is a 4-element list.

The input given to your test is different. If you run the AD-ksample test with Y instead of X, you get a very high statistic (205.9) against critical values of .49, 1.3, 1.9, 2.5, and 3.2.

That is to say, you can with great confidence reject the null hypothesis (NH = "all four columns of data are from the same distribution").