Solved – Determining probability distribution for datasets with missing values

distributionsexploratory-data-analysisfittingmissing datar

As a part of my exploratory data analysis (EDA) prior to further analysis, I'm trying to determine a probability distribution of my pilot dataset's variables. A particular feature of this dataset is a significant share of missing values. I partially alleviated this problem by performing multiple imputation (MI), using Amelia R package. The MI process resulted in reduction of missing values from 98% to 31%. If it's important, further analysis includes EFA, CFA and SEM-PLS modeling.

I have several questions in this regard. First, and, probably, main, question is: What is the correct (or optimal) approach to distribution fitting in terms of using parametric versus non-parametric methods? Another question is: Does it makes sense to combine both approaches for validation? The final question is: How presence of missing data influences approaches for distribution fitting?

The following are some of my thoughts, based on reading relevant discussions on CrossValidated. I apologize in advance, if they (thoughts) don't display high level of statistical rigor, as I'm not a statistician, but software developer turned social science researcher and aspiring data scientist.

In his answer to this question, @Glen_b suggests that, given large sample, non-parametric approach is easier and better, or, at least, not worse. However, it's not clear to me whether this rule of thumb has any "contraindications", so to speak. It is also not clear what is the consensus, if any, in regard to usefulness of performing automatic or semi-automatic process of distribution fitting.

In this great discussion, @Glen_b demonstrates investigating real data distribution via applying some transformations. In this regard, if the distribution is not multimodal, but just heavily skewed, it's not clear whether it makes sense to determine data distribution versus simply transforming data to conform normal distribution, using Box-Cox transformation.

In this discussion, @jpillow recommends, along with using Q-Q plots, Kolmogorov-Smirnov statistical test. However, in his paper "Fitting distributions with R", Vito Ricci states (p. 19): "Kolmogorov-Smirnov test is more powerful than chi-square test when sample size is not too great. For large size sample both the tests have the same power. The most serious limitation of Kolmogorov-Smirnov test is that the distribution must be fully specified, that is, location, scale, and shape parameters can’t be estimated from the data sample. Due to this limitation, many analysts prefer to use the Anderson-Darling goodness-of fit test. However, the Anderson-Darling test is only available for a few specific distributions." Then, there are Shapiro-Wilk and Lilliefors tests. Then there is the above-mentioned chi-square test, which can be applied to non-continuous distributions. Again, I'm rather confused in terms of decision-making process for selecting tests that I should use.

In terms of distribution fitting (DF), I have discovered several R packages, in addition to the ones mentioned in the paper by Ricci and elsewhere, such as 'fitdistrplus' (http://cran.r-project.org/web/packages/fitdistrplus) for non- and parametric DF and 'kerdiest' (http://cran.r-project.org/web/packages/kerdiest) for non-parametric DF. This is an FYI, for people who haven't heard about them and are curious. Sorry about the long question and thank you in advance for your attention!

Best Answer

What is the correct (or optimal) approach to distribution fitting in terms of using parametric versus non-parametric methods?

There won't be one correct approach, and what might be suitable depends on what you want to "optimize" and what you're trying to achieve with your analysis.

When there is little data, you don't have much ability to estimate distributions.

There is one interesting possibility that sort of sits between the two. It's effectively parametric (at least when you fix the dimension of the parameter vector), but in a sense the approach spans the space between a simple parametric model and a model with arbitrarily many parameters.

That is to take some base distributional model and build an extended family of distributions based on orthogonal polynomials with respect to the base distribution as weight function. This approach has been investigated by Rayner and Best - and a number of other authors - in a number of contexts and for a variety of base distributions. This includes "smooth" goodness of fit tests, but also similar approaches for analysis of count data (which allow decomposing into "linear", "quadratic" etc components that deviate from some null model), and a number of other such ideas.

So for example, one would take a family distributions based around the normal distribution and Hermite polynomials, or uniforms and Legendre polynomials, and so on.

This is especially useful where a particular model is expected to be close to suitable, but that the actual distribution will tend to deviation "smoothly" from the base model.

In the normal and uniform cases the methods are very simple, often more easily interpretable than other flexible methods, and often quite powerful.

Does it makes sense to combine both approaches for validation?

It would often make sense to use a nonparametric approach to check a parametric one.

The other way around may make sense in some particular circumstances.

Related Question