MATLAB: Can I use a CDF with parameters based on the data set in the KSTEST function in the Statistics Toolbox

accuracycdfcumulativedensitydocdocumentationinaacuratekstestStatistics and Machine Learning Toolbox

The documentation here:
states:
"The Kolmogorov-Smirnov test requires that cdf be predetermined. It is not accurate if cdf is estimated from the data."
Why are the results of the test inaccurate when the CDF is estimated from the data?

Best Answer

In the documentation, "inaccurate" does not refer to the numerical calculation performed in the KSTEST function, but rather to the relevance of the p-value.
There are two versions of the Kologorov-Smirnov test. One compares sample data from two completely unknown distributions, and tests whether the two distributions are significantly different. The other compares sample data from an unknown distribution to the CDF of a completely specified distribution and tests whether or not the data might have come from that distribution. The null distribution of the test statistic is known only in these two cases.
A test of the first case can be performed in MATLAB using the KSTEST2 function. The second case can be tested using the KSTEST function with a specified CDF.
Consider this example for the latter case: "Does the data in x come from a normal distribution with mean 1.5 and variance 3?" A common misconception is that the Kologorov-Smirnov test can be used to test this hypothesis: "Does the data in x come from a normal distribution with mean equal to mean(x) and variance equal to var(x)?" It cannot, because the estimated distribution N(mean(x),var(x)) is "too close" to the data, so the test statistic has the wrong null distribution.
In the case of a normal distribution, consider using the LILLIETEST function. The Lilliefors test evaluates the hypothesis that x has a normal distribution with unspecified mean and variance, against the alternative that x does not have a normal distribution.