Solved – Does GLM analysis require normally distributed data and homogeneity of variance

generalized linear modelheteroscedasticitynormal distributionr

My experiment looks at the effect of 6 different treatments on various (usually 18) physical and chemical properties of the test material (silage), at several different time points. I do my statistical analysis using R. I had been using ANOVA (aov()) (with Tukey's test (HSD.test()) for post hoc analysis), but was often finding that my data was not normally distributed (using shapiro.test()) and/or had unequal variance (leveneTest() – car package). I therefore started testing all my data, regardless of normality, using Kruskal-Wallis test (kruskal() – from the agricolae package, which also gives a post hoc analysis…).

I usually present my data as mean values, with the standard deviation in parenthesis. I also include the result of the post hoc test as a letter to indicate statistically significant differences within a column.

Other researchers, such as Arriola et al 2011, present and analyse their data differently. They talk about using GLM in SAS (not R) as follows:

“The data were analyzed as a completely randomized design using the GLM procedure of SAS (v. 9.2 SAS Institute Inc., Cary, NC). The general model was Yij = μ + Ti + eij, where Yij = response variable, μ = overall mean, T = effect of treatment i, and eij = error term. The F-protected least significant difference test was used to compare least squares means and significance was declared at P < 0.05.”

I am unclear if the 'GLM' analysis they describe is a General Linear Model or a Generalized Linear Model (and also do not understand what the difference is). I also note that this description does not mention testing for normality of the data or homogeneity of variance. Is it the case that GLM, unlike ANOVA/Kruskal- Wallis, does not rely on assumptions regarding the normality and variance of the data? Also, they don’t present standard deviations, only reporting the mean (with letters to indicate significant differences between means), but give SEM (standard error of the mean?), and sometimes also P value.

So, my questions are as follows:

  1. Assuming I wish to replicate the analysis of Arriola et al above, ideally using R, which analysis (General Linear Model or Generalized Linear Model) should I be trying to do here?
  2. Does this GLM require normal data and homogeneity of variance? (I often find that my data are not normally distributed and can not easily be transformed to normality).
  3. Can I look for interactions (e.g. treatment vs time) using GLM, similar to that done using ANOVA?
  4. Does a GLM provide a post hoc type analysis to demonstrate which means are significantly different; something like the Tukey's HSD used with ANOVA? If not, can a post hoc analysis be applied to the result of the GLM?
  5. Can I do all of the above in R, allowing me to report mean values, along with SEM and a P value, as shown by Arriola et al?

Best Answer

  1. From the quote you present about the model used, what little I know about SAS, and a quick look at the paper, it doesn't seem that those authors used the SAS PROC GLM to do anything beyond what you would get from a standard analysis of variance (ANOVA). Although SAS PROC GLM can handle many types of experimental designs, even true multivariate analysis of multiple outcome variables as you have in silage analysis, the "general model" they display seems simply to be a set of standard ANOVAs comparing treatments one analyte at a time.

  2. As their "general model" seems to be a set of standard ANOVAs despite their use of PROC GLM, the standard requirements for statistical significance testing apply. You probably know that you don't need "normal data" to meet these requirements, just residuals close enough to normal/homoscedastic, but I point that out for others who might look at this page. Many of the technical analyses lead to results based on fractions/percentages (e.g. percentage of total weight that is dry matter, DM; percent of DM accounted for by different types of fiber, etc). Such data can pose difficulties in meeting the requirements with respect to residuals, as the values are restricted in range. As no values of this type seem to be exactly 0% or exactly 100%, working with a logit transform of the data in fractional form (that is, start with fractions in the range 0 to 1 rather than percentages, then take the logit) or beta regression might work better.

3-4. Based on the above, you don't need to go down the route toward SAS PROC GLM.

  1. The reported values for SEM and p seem to be based on the ANOVA within-cell mean square and the number of observations per cell. (It's not immediately clear how they dealt with their duplicate technical measurements on each of the 4 replicates of each treatment. In principle they could have had separate estimates for technical error and among-batch error, but that's not consistent with how they presented their "general model.")