Solved – Goodness-of-Fit for continuous variables

goodness of fitkolmogorov-smirnov test

What are some Goodness-of-Fit tests or indices for the case of continuous variables?

For example, I am looking at the Kolmogorov–Smirnov test.

What I don't get is how one gets the empirical CDF in the first place? What I mean is, let's say I do a regression analysis with Gaussian errors. I have the maximum-likelihood estimate of the parameters. Now I also need to do a density estimation for the empirical CDF? Aren't they the same thing? Isn't my likelihood already giving me a goodness of fit? Why do I need K–S?

Best Answer

What are some Goodness of Fit tests or indicies for continuous case?

Most goodness of fit tests are for the continuous case. There are, quite literally, hundreds of them. Besides the Kolmogorov-Smirnov test (for a fully specified distribution, based on maximum difference in ECDF) some commonly used ones include the Anderson-Darling test (also fully specified and ECDF based; a variance-weighted version of the Cramer-von Mises test) and the Shapiro-Wilk (parameters unspecified, for testing normality only).

For example I am looking at Kolmogorov-Smirnov test.

Okay, but why? That is, why are you testing goodness of fit?

What I don't get is how one gets the emprical CDF in the first place?

It's simply the sample version of the cdf. The cdf is $P(X\leq x)$, the ECDF is the same thing, with 'probability' (for the random variable) replaced with 'proportion' (of the data). That is, you compute the proportion of the data that is less than or equal to every value $x$ in the range (ECDFs only change at data values, but are still defined between them - you really only need to identify their value at each data point and to the left of the entire sample, since they're constant from each data point until the next data point)

Take a small set of numbers and try it.

Here we go, a sample of three data values:

13.2  15.8  17.5

now, for the following $x$ values, what is the proportion of the data $\leq x$?

x = 10, 13.2-$\varepsilon$, 13.2, 13.2+$\varepsilon$, 15, 15.8, 17.5-$\varepsilon$, 19

(where $\varepsilon$ is some very small number)

Can you see how it works?

(Hint: the first five answers are 0, 0, 1/3, 1/3, 1/3 and the last one is 1; the full ECDF is plotted at the end of my answer)

What I mean is, let's say I do a regression analysis with gaussian errors.

What prompts you to use this example? Did something (a book, say, or a website) lead you to think you ought to use a goodness of fit test in this situation?

I have the maximum likelihood estimate of the parameters. Now I also need to do a density estimation for the emprical CDF?

Empirical cdf of what?

Note that the KS is a test, not an estimate. What hypothesis are you testing and why?

Aren't they the same thing?

No, they're quite different, as discussed below.

Isn't my likelihood already giving me a goodness of fit?

The likelihood for the regression tells you about fit of the line; in the case below, how close the red line is to the data.

enter image description here

You could replace the data with another set of values with the same summary statistics but a different distribution, and the likelihood would be identical.

See the Anscombe quartet for a good example of how very different data could have the same likelihood surface.

By contrast, With a goodness of fit test, you're checking the shape of some distribution, like a normal distribution with some mean and variance, fits the data (the KS measures the discrepancy from the hypothesized distribution by looking at the ECDF, giving a test that doesn't change when you transform both halves of the comparison - making it nonparametric):

enter image description here

So how does this relate to linear regression?

Some people try to test whether the assumption of normality around the line holds (such as the distribution in the green strip in the first plot), as a check on the assumption about the error distribution:

enter image description here

  • but this check is done across all x, not just some particular x (I did showed values near a particular $x$ to emphasize it's the conditional distribution of $y$ - or equivalently, the distribution of the errors - that is relevant).

-- it's not clear from your description if that's what you mean to ask about, though.


However:

1) formally testing goodness of fit as a check on assumptions isn't necessarily suitable;

(i) it answers the wrong question (the relevant question is 'what is the impact on my inference of the degree of non-normality we have?'), and

(ii) only tells you anything when it's of almost no use to you to know it (goodness of fit tests tend to show significance in medium to large samples, where it usually doesn't matter much, and tend not to be significant in small samples where it matters most), and

(iii) changing what you do based on the outcome is usually less appropriate than simply assuming you'd reject the null in the first place (your regression inference doesn't have the desired properties).

2) even without all that, the KS is a test for a fully specified distribution. You have to specify the mean and standard deviation for each data point before you see any data. If you're estimating the mean (say by fitting a line) and a standard deviation (say by the standard error of the residuals, s), then you simply shouldn't be using the KS test.

There are tests for the situation where you estimate the mean and variance (the equivalent to the KS test is called the Lilliefors test), but for normality the standard is the Shapiro Wilk test (though the simpler Shapiro-Francia test is almost as powerful, most stats software implements the full Shapiro-Wilk test).

Why do I need KS?

Well, basically you don't.

There is almost never a circumstance when that's a good choice for the situation you describe.

My suggestion is, to either use some procedure that doesn't assume normality (e.g. some robust approach, or perhaps least square but with inference based on resampling), or if you're in a position to reasonably assume normality, double-check the reasonableness of the assumption with a diagnostic display (like a Q-Q plot; incidentally the Shapiro-Francia test is effectively based on the $R^2$ in that plot).

In large samples, normality is less important to your inference (for everything but prediction intervals), so you can tolerate larger deviations from normality (equal variance and independence assumptions matter much more).

In small samples, you're more dependent on the assumption for your testing and confidence intervals, but you simply can't be sure how bad the degree of non-normality you have is. You're better with small samples to simply work as if your data were non-normal. (There are a number of good robust options, but you should usually also consider the potential impact of influential points, not just of potential y-outliers.)


ECDF for the small example data set earlier in the answer:

ECDF for 13.2, 15.8, 17.5