I have difficulties to understand how the Kolmogorov-Smirnov Test works. If I want to know if my samples are from a specific distribution (for example from the weibull distribution) I can compare my significance level to the p-Value I get from scipy.stats. If the p-Value is higher than my chosen alpha (5%) my samples are from the distribution. If p-Value is < 5% they are different.
In this code example I don't understand the result. My sample are from the same distribution i test against and I get a p-Value of 0 which means they are from a different distribution which makes no sense to me.
It would be great if someone could help me out with this.
import scipy.stats as stats
import numpy as np
smapleData = stats.weibull_min.rvs(2.34, loc=0, scale=1, size=10000)
x = np.linspace(0, max(tmp), num=10000, endpoint=True)
stats.kstest(stats.weibull_min.pdf(x, 2.34, loc=0, scale=1), smapleData)
#-> KstestResult(statistic=0.5031, pvalue=0.0)
I read that the KS-test might not be great for large Data. If someone would have an other idea how I can compare to Sample sets (without knowing the distribution behind it) how similar they are I would appreciate it.
Best Answer
You got a couple of things wrong while reading the documentation of the Kolmogorov-Smirnov test.
First you need to use the cumulative distribution function (CDF), not the probability density function (PDF). Second you have to pass the CDF as a callable function, not evaluate it at an equally spaced grid of points. [This doesn't work because the kstest function assumes you are passing along a second sample for a two-sample KS test.]
@Dave is correct that with hypothesis testing we don't accept the null hypothesis, we can only reject it or not reject it. The point is that "not reject" is not the same as "accept".
On the other hand, it sounds a bit awkward to say "we have a sample of 10,000 but we simply have insufficient evidence to conclude anything". At this sample size we expect that estimates are precise (have small variance).
Note that this situation is a bit hypothetical. In practice we rarely know the true distribution or that two large samples come from the same distribution as in the simulation. So in the real world, at sample sizes on the order of 10k, it's more likely that the p-value is small, not large.
So do we learn anything if the sample size is large and the p-value is large?
You can read more on the topic Are large data sets inappropriate for hypothesis testing?.