Solved – R and SAS produce the same test-statistics but different p values for normality tests

anderson darling testgoodness of fitkolmogorov-smirnov testrsas

I use Kolmogorov-Smirnov, Cramer-von Mises and Anderson-Darling tests to examine the normality of residuals of an OLS regression. It puzzles me that R and SAS, despite giving the same test-statistics for all tests, produce quite different p values. Below is a summary of my results

Test              | Test-Statistic | p-value in R | p-value in SAS
Kolmogorov-Smirnov| 0.12607        | 0.3506       | 0.038
Cramer-von Mises  | 0.14958        | 0.3919       | 0.023
Anderson-Darling  | 0.80307        | 0.4782       | 0.037

The R commands that I used to generate the results are

ks.test(model$residuals, 'pnorm',mean(model$residuals), sd(model$residuals))
library(goftest)
ad.test(model$residuals, 'pnorm',mean(model$residuals), sd(model$residuals))
cvm.test(model$residuals, 'pnorm',mean(model$residuals), sd(model$residuals))

The following codes in SAS are part of a macro to generate the results

data _ar_norm2_(drop=testlab);
    set _ar_normality_(where=(testlab='D') keep=testlab pvalue;
run;

data _ar_norm3_(drop=testlab);
    set _ar_normality_(where=(testlab='W-Sq') keep=testlab pvalue;
run;

data _ar_norm4_(drop=testlab);
    set _ar_normality_(where=(testlab='A-Sq') keep=testlab pvalue;
run;

What is the root cause of this difference?


Edit

Thank you very much Glen_b! Your answer is spot on.

library(nortest)
lillie.test(model$residuals)
ad.test(model$residuals)
cvm.test(model$residuals)

p-value = 0.0382
p-value = 0.03513
p-value = 0.02311

Best Answer

The actual Kolmogorov-Smirnov, Anderson-Darling and Cramer-von Mises tests are for completely specified distributions. You're estimating the mean and variance of the residuals in your code so you don't have completely specified distributions, which will make your p-values larger than they should be.

There's another test based on estimating parameters and using a Kolmogorov-Smirnov type statistic -- properly called a Lilliefors test; it's no longer distribution free and you need a different distribution for the test statistic depending on which distribution you start with and which parameters you esitmate. Lilliefors did the normal and exponential cases. The normal with both parameters estimated case can be done in R using lillie.test in the nortest package.

For the other two tests the same comments apply (though approximate adjustments are a little simpler); the versions you're using in goftest are again for completely specified distributions.

In the same package I mentioned earlier (nortest) there are versions of the Cramer-von Mises and Anderson Darling tests for the case of testing normality. If you check the help on those functions, they specify that they're for the composite hypothesis of normality, which is what you seek here.

That won't necessarily make the p-values identical across SAS and R (they may not use the same approximations, for example) but if you use the corresponding tests they should be much closer.

There's an additional issue in your case -- it appears you're testing residuals (perhaps in an AR, but it doesn't matter for the present point). Even the versions in nortest don't account for the dependence between residuals. They're for independent, identically distributed values from a normal distribution with unspecified mean and variance. If you had normal errors you don't have independence of residuals and you don't usually have exactly identical distributions.

So even if you account for the estimation issue, the tests still won't be exactly right. I don't know what SAS is doing, but my guess is it's probably not accounting for this non-i.i.d. issue either.


As a general rule, if you want to test normality I wouldn't use multiple tests, (pick one that best identifies the kinds of deviations from normality you most want to pick up) and indeed, I wouldn't use those tests (though the Anderson Darling is often a pretty decent choice) -- I'd use Shapiro Wilk or one of the related tests to it.

On the other hand if I am trying to assess the suitability of a normality assumption for some model, I wouldn't use a formal hypothesis test at all. The problem is not "are the errors really normal?" (outside of simulated data are they ever actually normal? I seriously doubt it), it's "how much difference does it make?". That's an effect-size question, not a hypothesis testing question.

Related Question