Solved – Trying to choose the right post-hoc analysis for the unbalanced dataset

post-hocr

I encountered some problems with how to analyze my unequal design data set.

The number of observations was influenced by survival rate of
seedlings in tree nursery, where different concentrations of
fertilizer were applied.
I conducted Anova test for my stacked data (which is the height of
seedlings and the various concentrations (5) of applied fertilizer).

But can I use simply Anova (aov) if I have unequal sample size? Though the real fun started, when I was trying to find out is TukeyHSD a suitable post-hoc test in this case. Some sources said that Tukey's test is designed for balanced data, while some claimed that TukeyHSD is considered the best available method when confidence intervals are needed or sample sizes are not equal.

I tried to calculate Anova(in excel) for one of the pairs and I got different p-value than from the TukeyHSD test in R. So I assume that the various sample size in one or in another case influenced the result I got.

Would some of you have any suggestions?

Best Answer

You need balanced data for the usual tables and hand calculations to be correct. However, if you use the R glht function in the multcomp package, its calculations are based on the multivariate $t$ distribution with the funny covariance structure you get with unequal sample sizes, so the adjusted P values are correct as long as the normality, homoscedasticity, and independence assumptions hold. The needed call would be something like

summary(glht(model, mcp(tukey = "trt")))

You can also get these adjustments via the lsmeans package and a call like

pairs(lsmeans(model, "trt"), adjust = "mvt")

Related Solutions

Solved – Pairwise comparison of vectors with unequal sizes and unequal variances

From the sounds of it, you are comparing mean levels of outcome in 3 different groups. Linear regression will do this, and if you want robustness against different variances in the different groups, robust estimates of standard error can be used to take care of this.

Here's some R code that generates some example data, does the linear regression, computes robust standard errors, and performs a test that all three group means are equal

# generate the data
set.seed(4)
y1 <- rnorm(21, mean=3, sd=3)
y2 <- rnorm(33, mean=2, sd=3.5)
y3 <- rnorm(7, mean=4, sd=2.4)
y <- c(y1, y2, y3)
group <- factor(rep(1:3, times=c(21,33,7)))

# do the regression
lm1 <- lm(y~group)

# perform the test, using robust standard errors
library("sandwich") # you may need to install these packages
library("lmtest")

waldtest(lm1, vcov=vcovHC(lm1) )

If the variance doesn't differ very much between groups, you'll probably be fine without the robust standard errors.

Solved – Does GLM analysis require normally distributed data and homogeneity of variance

From the quote you present about the model used, what little I know about SAS, and a quick look at the paper, it doesn't seem that those authors used the SAS PROC GLM to do anything beyond what you would get from a standard analysis of variance (ANOVA). Although SAS PROC GLM can handle many types of experimental designs, even true multivariate analysis of multiple outcome variables as you have in silage analysis, the "general model" they display seems simply to be a set of standard ANOVAs comparing treatments one analyte at a time.
As their "general model" seems to be a set of standard ANOVAs despite their use of PROC GLM, the standard requirements for statistical significance testing apply. You probably know that you don't need "normal data" to meet these requirements, just residuals close enough to normal/homoscedastic, but I point that out for others who might look at this page. Many of the technical analyses lead to results based on fractions/percentages (e.g. percentage of total weight that is dry matter, DM; percent of DM accounted for by different types of fiber, etc). Such data can pose difficulties in meeting the requirements with respect to residuals, as the values are restricted in range. As no values of this type seem to be exactly 0% or exactly 100%, working with a logit transform of the data in fractional form (that is, start with fractions in the range 0 to 1 rather than percentages, then take the logit) or beta regression might work better.

3-4. Based on the above, you don't need to go down the route toward SAS PROC GLM.

The reported values for SEM and p seem to be based on the ANOVA within-cell mean square and the number of observations per cell. (It's not immediately clear how they dealt with their duplicate technical measurements on each of the 4 replicates of each treatment. In principle they could have had separate estimates for technical error and among-batch error, but that's not consistent with how they presented their "general model.")

Best Answer

Related Solutions

Solved – Pairwise comparison of vectors with unequal sizes and unequal variances

Solved – Does GLM analysis require normally distributed data and homogeneity of variance

Related Question