1) The null hypothesis is that the data is distributed according to the theoretical distribution.

2) Let $N$ be your sample size, $D$ be the observed value of the Kolmogorov-Smirnov test statistic, and define $\lambda = D(0.12 + \sqrt{N} + 0.11 / \sqrt{N})$. Then the p-value for the test statistic is approximately:

$Q = 2 \sum_{j=1}^{\infty}(-1)^{j-1}\exp\{-2j^2\lambda^2\}$

Obviously you can't calculate the infinite sum, but if you sum over 100 values or so this will get you very, very, very close. This approximation is quite good even for small values of $N$, as low as 5 if I recall correctly, and gets better as $N$ increases. Note, however, that @whuber in comments proposes a better approach.

This is a perfectly reasonable alternative to the Shapiro-Wilk test I suggested in answer to your other question, by the way. Shapiro-Wilk is more powerful, but if your sample size is in the high hundreds, the Kolmogorov-Smirnov test will have quite a bit of power too.

Broadly speaking (not just in goodness of fit testing, but in many other situations), you simply *can't* conclude that the null is true, because there are alternatives that are effectively indistinguishable from the null at any given sample size.

Here's two distributions, a standard normal (green solid line), and a similar-looking one (90% standard normal, and 10% standardized beta(2,2), marked with a red dashed line):

The red one is not normal. At say $n=100$, we have little chance of spotting the difference, so we can't assert that data are drawn from a normal distribution -- what if it were from a non-normal distribution like the red one instead?

Smaller fractions of standardized betas with equal but larger parameters would be much harder to see as different from a normal.

But given that real data are almost *never* from some simple distribution, if we had a perfect oracle (or effectively infinite sample sizes), we would essentially *always* reject the hypothesis that the data were from some simple distributional form.

As George Box famously put it, "*All models are wrong, but some are useful.*"

Consider, for example, testing normality. It may be that the data actually come from something close to a normal, but will they ever be *exactly* normal? They probably never are.

Instead, the best you can hope for with that form of testing is the situation you describe. (See, for example, the post *Is normality testing essentially useless?*, but there are a number of other posts here that make related points)

This is part of the reason I often suggest to people that the question they're actually interested in (which is often something nearer to 'are my data close enough to distribution $F$ that I can make suitable inferences on that basis?') is usually not well-answered by goodness-of-fit testing. In the case of normality, often the inferential procedures they wish to apply (t-tests, regression etc) tend to work quite well in large samples - often even when the original distribution is fairly clearly non-normal -- *just when a goodness of fit test will be very likely to reject normality*. It's little use having a procedure that is most likely to tell you that your data are non-normal just when the question doesn't matter.

Consider the image above again. The red distribution is non-normal, and with a really large sample we could reject a test of normality based on a sample from it ... but at a much smaller sample size, regressions and two sample t-tests (and many other tests besides) will behave so nicely as to make it pointless to even worry about that non-normality even a little.

Similar considerations extend not only to other distributions, but largely, to a large amount of hypothesis testing more generally (even a two-tailed test of $\mu=\mu_0$ for example). One might as well ask the same kind of question - *what is the point of performing such testing if we can't conclude whether or not the mean takes a particular value?*

You might be able to specify some particular forms of deviation and look at something like equivalence testing, but it's kind of tricky with goodness of fit because there are so many ways for a distribution to be close to but different from a hypothesized one, and different forms of difference can have different impacts on the analysis. If the alternative is a broader family that includes the null as a special case, equivalence testing makes more sense (testing exponential against gamma, for example) -- and indeed, the "two one-sided test" approach carries through, and that might be a way to formalize "close enough" (or it would be if the gamma model were true, but in fact would itself be virtually certain to be rejected by an ordinary goodness of fit test, *if only the sample size were sufficiently large*).

Goodness of fit testing (and often more broadly, hypothesis testing) is really only suitable for a fairly limited range of situations. The question people usually want to answer is not so precise, but somewhat more vague and harder to answer -- but as John Tukey said, "*Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.*"

Reasonable approaches to answering the more vague question may include simulation and resampling investigations to assess the sensitivity of the desired analysis to the assumption you are considering, compared to other situations that are also reasonably consistent with the available data.

(It's also part of the basis for the approach to robustness via $\varepsilon$-contamination -- essentially by looking at the impact of being within a certain distance in the Kolmogorov-Smirnov sense)

## Best Answer

Numerous posts on site address this issue, in various guises.

It can't! Let us assume the null is true for a moment. Nevertheless there's essentially always going to be population distributions that are

closer to the datathan the hypothesized (i.e. actual) distribution is.This is no different from the problem of demonstrating that the population mean

issome hypothesised value.Indeed even if the p value was exactly 1 you still could not assert the null; there's an infinite number of adjacent alternatives with p value as close as you like to 1.

Data cannot demonstrate an equality null; you can sometimes discover a discrepancy large enough to place doubt on it.

You might consider whether an equivalence test might make more sense for your circumstances, but if not there's generally going to be little you can do.