Solved – What are appropriate tests for goodness of fit on glm with a small sample size

generalized linear modelgoodness of fitnormality-assumptionregressionsmall-sample

I've thought quite a lot on large sample size inference where the strong law of large numbers is easily validated. In my case however, I'm trying to infer the sign and magnitude of an outcome where the noise-to-signal is quite large in comparison to my available sample size.

How do I properly test for goodness of fit? I possess a very small sample size (n = 16). I cannot possibly get more data as it simply doesn't exist.

I'm fitting a generalized linear model with Gaussian errors for which I obtained the following p-values with summary(my.glm.fit) using R's glm() function for fitting.

intercept   0.34172  
slope       0.00734
  • Using Cook's D gives one data point that is problematic. If anyone knows about its validity with small samples, please speak out!
  • I don't think a Durbin–Watson test statistic is appropriate for n = 16. Even if my data has almost surely no auto-correlation, I have very sound reasons to think it doesn't. What could I use instead?
  • Normality tests for small sample size…is Lilliefors OK? or should I go with a Bayesian test?

Anything else you can point out? Anything else missing in my maybe too-broad question? I could provide more details if that's of any interest to you guys. Partial answers are also welcome.
Thanks.

Best Answer

I'm assuming here that you are pretty happy with your model, and believe that there are no serious lack-of-fit (LoF) issues. LoF shows up in the residuals and will well-and-truly mess up any and every test you might use that is based on the residuals. Your use of influence measures causes me to question the validity of that assumption.

Goodness of fit tests are fair weather friends:

  1. They have abundant power in large samples, to the point that they can detect differences you really don't care about.
  2. They don't have very much power in small sample situations.

In other words, they work best when you don't need them and don't work well when you need them most.

Cook's D and related measures (like Belsley's DFFITS) are not goodness of fit statistics. They are influence statistics, trying to measure how important a given point is in your result. The thing to remember about them is that although they marginally have the derived distribution, they do not have the distribution jointly. Use the suggested cutoffs as rough guides rather than as commandments. The thing Cook's D tells you to do is to check the point out and be aware of its impact on your conclusions.

According to Madansky (Prescriptions for Working Statisticians), the Shapiro-Wilk test is the best overall compromise test of Normality. Shapiro, Wilk (and later Chen's) work bear this out. In the last ten years, some work with alternative tests (based on measures of skewness and kurtosis) have slightly better properties than the S-W test. They have not hit the mainstream yet.

There aren't any good tests of autocorrelation for tiny samples that I am aware of. I suppose D-W is as good as any.