Solved – What are appropriate tests for goodness of fit on glm with a small sample size

generalized linear modelgoodness of fitnormality-assumptionregressionsmall-sample

I've thought quite a lot on large sample size inference where the strong law of large numbers is easily validated. In my case however, I'm trying to infer the sign and magnitude of an outcome where the noise-to-signal is quite large in comparison to my available sample size.

How do I properly test for goodness of fit? I possess a very small sample size (n = 16). I cannot possibly get more data as it simply doesn't exist.

I'm fitting a generalized linear model with Gaussian errors for which I obtained the following p-values with summary(my.glm.fit) using R's glm() function for fitting.

intercept   0.34172  
slope       0.00734

Using Cook's D gives one data point that is problematic. If anyone knows about its validity with small samples, please speak out!
I don't think a Durbin–Watson test statistic is appropriate for n = 16. Even if my data has almost surely no auto-correlation, I have very sound reasons to think it doesn't. What could I use instead?
Normality tests for small sample size…is Lilliefors OK? or should I go with a Bayesian test?

Anything else you can point out? Anything else missing in my maybe too-broad question? I could provide more details if that's of any interest to you guys. Partial answers are also welcome.
Thanks.

Best Answer

I'm assuming here that you are pretty happy with your model, and believe that there are no serious lack-of-fit (LoF) issues. LoF shows up in the residuals and will well-and-truly mess up any and every test you might use that is based on the residuals. Your use of influence measures causes me to question the validity of that assumption.

Goodness of fit tests are fair weather friends:

They have abundant power in large samples, to the point that they can detect differences you really don't care about.
They don't have very much power in small sample situations.

In other words, they work best when you don't need them and don't work well when you need them most.

Cook's D and related measures (like Belsley's DFFITS) are not goodness of fit statistics. They are influence statistics, trying to measure how important a given point is in your result. The thing to remember about them is that although they marginally have the derived distribution, they do not have the distribution jointly. Use the suggested cutoffs as rough guides rather than as commandments. The thing Cook's D tells you to do is to check the point out and be aware of its impact on your conclusions.

According to Madansky (Prescriptions for Working Statisticians), the Shapiro-Wilk test is the best overall compromise test of Normality. Shapiro, Wilk (and later Chen's) work bear this out. In the last ten years, some work with alternative tests (based on measures of skewness and kurtosis) have slightly better properties than the S-W test. They have not hit the mainstream yet.

There aren't any good tests of autocorrelation for tiny samples that I am aware of. I suppose D-W is as good as any.

Related Solutions

Hypothesis Testing – Appropriate Normality Tests for Small Samples

The fBasics package in R (part of Rmetrics) includes several normality tests, covering many of the popular frequentist tests -- Kolmogorov-Smirnov, Shapiro-Wilk, Jarque–Bera, and D'Agostino -- along with a wrapper for the normality tests in the nortest package -- Anderson–Darling, Cramer–von Mises, Lilliefors (Kolmogorov-Smirnov), Pearson chi–square, and Shapiro–Francia. The package documentation also provides all the important references. Here is a demo that shows how to use the tests from nortest.

One approach, if you have the time, is to use more than one test and check for agreement. The tests vary in a number of ways, so it isn't entirely straightforward to choose "the best". What do other researchers in your field use? This can vary and it may be best to stick with the accepted methods so that others will accept your work. I frequently use the Jarque-Bera test, partly for that reason, and Anderson–Darling for comparison.

You can look at "Comparison of Tests for Univariate Normality" (Seier 2002) and "A comparison of various tests of normality" (Yazici; Yolacan 2007) for a comparison and discussion of the issues.

It's also trivial to test these methods for comparison in R, thanks to all the distribution functions. Here's a simple example with simulated data (I won't print out the results to save space), although a more full exposition would be required:

library(fBasics); library(ggplot2)
set.seed(1)

# normal distribution
x1 <- rnorm(1e+06)   
x1.samp <- sample(x1, 200)
qplot(x1.samp, geom="histogram")
jbTest(x1.samp)
adTest(x1.samp)

# cauchy distribution
x2 <- rcauchy(1e+06)
x2.samp <- sample(x2, 200)
qplot(x2.samp, geom="histogram")
jbTest(x2.samp)
adTest(x2.samp)

Once you have the results from the various tests over different distributions, you can compare which were the most effective. For instance, the p-value for the Jarque-Bera test above returned 0.276 for the normal distribution (accepting) and < 2.2e-16 for the cauchy (rejecting the null hypothesis).

Shapiro Wilk Test – Is the Shapiro Wilk Test W an Effect Size?

As you know, $W$ is a test statistic. In most cases (all consistent tests), a test statistic is not a suitable effect estimator as the statistic reflects the sample size whereas the effect estimator shall be independent of it. Just think of an asymptotic test to test zero mean under the central limit theorem: The approximate distribution is the same for all $n$, so the test statistic contains even all the information about the sample size. That makes the test statistic unsuitable as effect estimater.

For $W$, it is similar (although the approximate distribution depends on the sample size as well). The lower bound for $W$ is $\frac{a_1^2n}{(n-1)}$, where $a_1$ depends on is the expectation for the smallest order statistic.

So no, it is no suitable effect estimator at all.

In fact, I think you are not yet sure what you are looking for as the term "effect" is a bit more difficult than in the usual parametric world of one-dimensional parameters. Here, the raw effect of a.s. not being normally distributed is infinite dimensional: Each measurable subset of $\mathbb{R}$ can have a different probability from the normal distribution model. For a one-dimensional effect, you need to weight it somehow and be aware of the consequences of various weights to your intended application. This way you would decide if e.g. a certain bimodal distribution with Gaussian tails is more normal than a certain unimodal distribution with heavy tails. In fact trading the tail behaviour against the non-tail behaviour might be the most relevant question to invent a suitable effect.

Then, if will be much easier to find an estimator for this particular effect.

Best Answer

Related Solutions

Hypothesis Testing – Appropriate Normality Tests for Small Samples

Shapiro Wilk Test – Is the Shapiro Wilk Test W an Effect Size?

Related Question