Solved – R -Goodness of fit for t distribution with estimated parameters

goodness of fitrt-distribution

I'm trying to find out if my data are fitting a t-distribution. My data set is very large (more than 5000 data) and I used "fitdistr" (MASS package) to estimate mean, sd and df. I used the Kolmogorov-Smirnov test ("ks.test.t", LambertW package in R) to asses the goodness of fit, but I read that it should not be used if the parameters of my distribution have been estimated from the sample.

First of all I was wondering why KS test should not work fine.

Is there a test, in R, to asses the goodness of fit with estimated parameters ? I'm really a beginner in using R, so I would appreciate if you can explain me step by step how to do it.

Thank you in advance! =)

Best Answer

The Kolmogorov-Smirnov test is designed for situations where a continuous distribution is fully specified under the null hypothesis.

Let's look at what happens with the null distribution of the test statistic when the null hypothesis is true.

When you estimate parameters, the estimation identifies parameters that make the estimated distribution closer to the data than the population distribution is.

Let's take the slightly simpler example; the normal.

Here I generate a sample of 100 values from a $N(50,5)$ (the black points in the ECDF) and compare to the population distribution function (in blue) and the fitted distribution function (normal with the mean and variance set to the sample mean and variance, shown in red):

KS statistic for population parameters: D = 0.19987  
KS statistic for fitted distribution:   D = 0.14715

This it typical. However, it is possible for the statistic to be larger on the fitted because we don't actually fit the distribution by minimizing the KS statistic; if we did estimate parameters that way the fitted normal distribution would be guaranteed to have a smaller test statistic.

This "fitted is closer to the data than the population" is the same thing that results in dividing by $n-1$ in sample variance (Bessel-correction); here it makes the test statistic typically smaller than it should be.

So if you stuck with the usual tables the type I error rate would be smaller than you chose it to be (with corresponding lowering of power); your test doesn't behave the way you want it to.

You may like to read about Lilliefors test (on which there are many posts here). Lilliefors computed (via simulation) the distribution of a Kolmogorov-Smirnov statistic on fitted distributions under normal (unknown $\mu$, unknown $\sigma$, and both parameters unknown) and exponential cases (1967,1969)

Once you fit a distribution, the test is no longer distribution-free.

In the case where you're fitting the degrees of freedom parameter, I don't think Lilliefors approach will work for the t-distibution*; the advice to use bootstrapping may be reasonable in large samples.

* because the distribution of the test statistic will be different for different df (however, it might be that it doesn't vary much with df in which case you could still have a reasonable approximate test)

Related Solutions

Solved – Goodness of fit to Poisson Distribution

1) There are two issues with the Kolmogorov-Smirnov* -

a) it assumes the distribution is completely specified, with no estimated parameters. If you estimate parameters a KS becomes a form of Lilliefors test (in this case for Poisson-ness), and you need different critical values

b) it assumes the distribution is continuous

both impact the calculation of p-values, and both make it less likely to reject.

*(and the Cramer-von Mises and the Anderson Darling, and any other test that assumes a continuous, completely specified null)

Unless you don't mind a potentially highly-conservative test (of unknown size), you have to adjust the calculation of the significance for both of these; simulation would be called for.

2) on the other hand, a vanilla chi-square goodness of fit is a terrible idea when testing something that's ordered, as a Poisson is. By ignoring ordering, it's really not very sensitive to the more interesting alternatives - it throws away power against directly interesting alternatives like overdispersion, instead spending its power against things like 'an excess of even numbers over odd numbers'. As a result its power against interesting alternatives is generally even lower than the vanilla KS but without the compensation of the much lower type I error rate.

I think this is even worse.

3) on the gripping hand, you can partition the chi-squared into components that do respect the ordering via the use of orthogonal polynomials, and drop off the less interesting highest-order components. In this particular case you'd use polynomials orthogonal to the Poisson p.f.

This is an approach taken in Rayner and Best's little 1989 book on Smooth Tests of Goodness of Fit (they have a newer one on smooth tests in R that might make your life easier)

Alternatively, see papers like this one:

http://www.jstor.org/discover/10.2307/1403470

4) However, depending on why you're doing it, it may be better to reconsider the whole enterprise...

The discussion in questions like these carry over to most goodness of fit tests ... and indeed often to most tests of assumptions in general:

Is normality testing 'essentially useless'?

What tests do I use to confirm that residuals are normally distributed?

Discrete Data – How to Fit a Discrete Distribution to Count Data

Methods of fitting discrete distributions

There are three main methods* used to fit (estimate the parameters of) discrete distributions.

1) Maximum Likelihood

This finds the parameter values that give the best chance of supplying your sample (given the other assumptions, like independence, constant parameters, etc)

2) Method of moments

This finds the parameter values that make the first few population moments match your sample moments. It’s often fairly easy to do, and in many cases yields fairly reasonable estimators. It’s also sometimes used to supply starting values to ML routines.

3) Minimum chi-square

This minimizes the chi-square goodness of fit statistic over the discrete distribution, though sometimes with larger data sets, the end-categories might be combined for convenience. It often works fairly well, and it even arguably has some advantages over ML in particular situations, but generally it must be iterated to convergence, in which case most people tend to prefer ML.

The first two methods are also used for continuous distributions; the third is usually not used in that case.

These by no means comprise an exhaustive list, and it would be quite possible to estimate parameters by minimizing the KS-statistic for example – and even (if you adjust for the discreteness), to get a joint consonance region from it, if you were so inclined. Since you’re working in R, ML estimation is quite easy to achieve for the negative binomial. If your sample were in x, it’s as simple as library(MASS);fitdistr (x,"negative binomial"):

> library(MASS) 
> x <- rnegbin(100,7,3)
> fitdistr (x,"negative binomial")
     size         mu    
  3.6200839   6.3701156 
 (0.8033929) (0.4192836)

Those are the parameter estimates and their (asymptotic) standard errors.

In the case of the Poisson distribution, MLE and MoM both estimate the Poisson parameter at the sample mean.

If you'd like to see examples, you should post some actual counts. Note that your histogram has been done with bins chosen so that the 0 and 1 categories are combined and we don't have the raw counts.

As near as I can guess, your data are roughly as follows:

    Count:  0&1   2   3   4   5   6  >6    
Frequency:  311 197  74  15   3   1   0

But the big numbers will be uncertain (it depends heavily on how accurately the low-counts are represented by the pixel-counts of their bar-heights) and it could be some multiple of those numbers, like twice those numbers (the raw counts affect the standard errors, so it matters whether they're about those values or twice as big)

The combining of the first two groups makes it a little bit awkward (it's possible to do, but less straightforward if you combine some categories. A lot of information is in those first two groups so it's best not to just let the default histogram lump them).

* Other methods of fitting discrete distributions are possible of course (one might match quantiles or minimise other goodness of fit statistics for example). The ones I mention appear to be the most common.