Goodness-of-Fit Tests – Should Outliers Be Removed for Goodness-of-Fit Tests?

distributionsgoodness of fitoutliers

If you allow a bit digression about the context: I am on a journey to better understand the power and usefulness of parametric distributions; I am a bit scared of them. Maybe due to the fact that I've entered the world of data analysis more from the side of "data science" & ML rather than from the side of pure statistics, I believed that all the answers lay in the dataset we have and that nonparametric statistics was THE best, safer answer. Reading around I also understood its limitations, so I'm willing to understand for once how parametric statistics fit into the picture and how can I take advantage of it.

On thing I never fully understood is how to match those neat parametric distributions I see in stats courses with the ugly dirty distributions of real-world data. Actually, this would be the real question for me. The way I see it, there are so many distributions to chose from that I feel I should know of all them to pick the right one, and there is high chance of being wrong.

However, I learned that there are tests like the Kolmogorov-Smirnov Test for goodness-of-fit which tests whether your data matches a given parametric distribution. Is running a KS test against all main distributions the ultimate solution? And if it is, what's still not clear to me is if I should remove outliers first.

Best Answer

It is in the nature of statistical tests that non-rejection of a null hypothesis does not mean that the null hypothesis is true. This particularly means that not rejecting a model assumption by the KS (or any) test does not make sure that the model is true. If you test several models, you may well not reject several of them.

In fact whether model assumptions should be tested before using a model is controversial and to what extent this is useful depends on the specific situation. We have a paper that discusses the issue in some depth: https://arxiv.org/abs/1908.02218

One reason against testing model assumptions is what I call "misspecification paradox", namely that conditionally on not rejecting a model assumption, data violate the model assumption, even if they followed the model before misspecification/goodness-of-fit testing, see Most interesting statistical paradoxes.

On the other hand, it is a misconception that model assumptions need to be fulfilled in order to apply a model-based method. In fact many model-based methods work quite well also in situations in which the model is violated, although this depends on what exactly you do and how the model is violated, see the paper linked above. In many situations, the misspecification paradox mentioned above, even though technically violating the model assumption, doesn't affect a method's performance a lot.

In fact nonparametric methods are not a magic bullet in the sense that there are situations in which a model-based method can do better than a nonparametric one even if the nominal model is violated. An example regarding comparing the two-sample t-test with a nonparametric Wilcoxon test is also given in the paper. This of course depends on what your aim is. If you have lots of data and prediction quality is your primary aim, a parametric method will have a very hard time to beat a good nonparametric one. However for decision making you may want to summarise the data using statistics such as the mean or regression estimators that can be interpreted. Nonparametric methods in such situations may not give you what you want. Parametric modelling may also serve to think in a clearer and better way about what goes on, and more sophisticated models such as mixed effects/multilevel or time series models can incorporate detailed information about how the data were collected and what kind of structure there is regarding how they depend on each other.

Regarding removing outliers, I'd only remove outliers if there is strong evidence, usually from background knowledge, that these observations are indeed erroneous. Your data are information; removing outliers means removing potentially meaningful and important information. Just being outlying alone is not enough of a reason for removal (unless the value is actually impossible). Many (but not all) parametric methods can be badly affected by outliers, but there is so-called robust statistic that still estimates parameters but in ways less or not affected by outliers.