ANOVA Nonparametric Assumptions – Is There an Assumption-Free ANOVA?

anovaassumptionsnonparametric

ANOVA presupposes a normal distribution and equal variance. Kruskal–Wallis (non-parametric ANOVA) assumes that all population distributions are the same (except their parameters).

I'd like to know if there is an assumption-free test: an ANOVA test which just assumes a continuous distribution and independent and identically distributed data.

Best Answer

If you are performing inference on the mean and would like to compare groups (even while adjusting for covariates) you can use a semi-parametric generalized estimating equation (GEE) model where the variance is modeled independently from the mean (which is still a least squares model like ANOVA). You can also include a non-linear link function between the mean and the linear predictor. For robust inference you can use asymptotic Wald tests and confidence intervals based on the empirical sandwich covariance estimator. All of this allows for inference on means without needing to specify the underlying data distribution. You can fit such a semi-parametric model using a generalized linear model package like glm in R or Proc Genmod in SAS.

In contrast, a typical ANOVA uses a single common variance term to calculate all of the standard errors for the model parameters and inference is performed using t-tests under the assumption of normally distributed data. Of course the t-test is very similar to the Wald test and is robust to distribution misspecification so long as the mean estimator is approximately normally distributed and the variance estimator is consistent.

As an example I simulated $10,000$ Monte Carlo samples of $n=50$ observations from a $\text{Weibull}(k=1.1,\lambda=3)$ distribution to investigate the coverage probability of the $95\%$ Wald confidence interval for the mean, $\mu=\lambda\Gamma(1+1/k)$, based on least squares estimating equations and the sandwich covariance estimator. Using an identity link function the $95\%$ Wald CI covered $93.1\%$ of the time. Using a log link function the $95\%$ Wald CI covered $93.6\%$ of the time. With a sample size of $n=100$ these coverage probabilities become $93.6\%$ and $94.1\%$, respectively. These results are based on SAS Proc Genmod.

To address Frank Harrell's concern I simulated $1,000$ Monte Carlo samples of $n=50,000$ from a $X\sim$ $\text{Pareto}(x_m=1, \alpha=3)$ distribution with $E[X]=\frac{\alpha x_m}{\alpha-1}=1.5$ and $\text{Var}[X]=\frac{x_m^2\alpha}{(\alpha-1)^2(\alpha-2)}=3/4$. The largest simulated value was over 900. Both the Wald interval with an identity link and a log link covered $E[X]$ $95.7\%$ of the time. I also simulated $1,000$ Monte Carlo samples of $n=50,000$ from a $\text{Pareto}(x_m=1, \alpha=2)$ distribution with $E[X]=\frac{\alpha x_m}{\alpha-1}=2$ and $\text{Var}[X]=\infty$. The Monte Carlo variance of the sample mean was $15.15$ and the largest simulated value was over $6,000$. The $95\%$ confidence intervals with an identity and log link covered $93.2\%$ and $93.4\%$ of the time, respectively. Using a higher confidence level such as $96\%$ or $97\%$ should bring the true coverage rate closer to $95\%$.

Of course with $n=50,000$ observations one might feel comfortable fitting a parametric Pareto model. Here is a thread on ResearchGate where I describe inverting the CDF of the maximum likelihood estimator while profiling nuisance parameters to construct confidence limits and confidence curves for the shape and scale parameters of a Pareto distribution. This approach could also be used for inference on the mean.

@Frank Harrel, if there is a particular distribution you would like to suggest where $n=50,000$ is insufficient for reliable inference on the mean using semi-parametric generalized estimating equations, let me know.