Solved – How to compare two groups with multiple measurements for each individual with R

error-propagationrstatistical significancet-test

I have a problem like the following:

1) There are six measurements for each individual with large within-subject variance

2) There are two groups (Treatment and Control)

3) Each group consists of 5 individuals

4) I want to perform a significance test comparing the two groups to know if the group means are different from one another.

The data looks like this:

And I have run some simulations using this code which does t tests to compare the group means. The group means were calculated by taking the means of the individual means. This ignores within-subject variability:

 n.simulations<-10000
    pvals=matrix(nrow=n.simulations,ncol=1)
    for(k in 1:n.simulations){
      subject=NULL
      for(i in 1:10){
        subject<-rbind(subject,as.matrix(rep(i,6)))
      }
      #set.seed(42)

      #Sample Subject Means
      subject.means<-rnorm(10,100,2)

      #Sample Individual Measurements
      values=NULL
      for(sm in subject.means){
        values<-rbind(values,as.matrix(rnorm(6,sm,20)))
      }

      out<-cbind(subject,values)

      #Split into GroupA and GroupB
      GroupA<-out[1:30,]
      GroupB<-out[31:60,]

      #Add effect size to GroupA
      GroupA[,2]<-GroupA[,2]+0

      colnames(GroupA)<-c("Subject", "Value")
      colnames(GroupB)<-c("Subject", "Value")

      #Calculate Individual Means and SDS
      GroupA.summary=matrix(nrow=length(unique(GroupA[,1])), ncol=2)
      for(i in 1:length(unique(GroupA[,1]))){
        GroupA.summary[i,1]<-mean(GroupA[which(GroupA[,1]==unique(GroupA[,1])[i]),2])
        GroupA.summary[i,2]<-sd(GroupA[which(GroupA[,1]==unique(GroupA[,1])[i]),2])
      }
      colnames(GroupA.summary)<-c("Mean","SD")


      GroupB.summary=matrix(nrow=length(unique(GroupB[,1])), ncol=2)
      for(i in 1:length(unique(GroupB[,1]))){
        GroupB.summary[i,1]<-mean(GroupB[which(GroupB[,1]==unique(GroupB[,1])[i]),2])
        GroupB.summary[i,2]<-sd(GroupB[which(GroupB[,1]==unique(GroupB[,1])[i]),2])
      }
      colnames(GroupB.summary)<-c("Mean","SD")

      Summary<-rbind(cbind(1,GroupA.summary),cbind(2,GroupB.summary))
      colnames(Summary)[1]<-"Group"

      pvals[k]<-t.test(GroupA.summary[,1],GroupB.summary[,1], var.equal=T)$p.value
    }

And here is code for plots:

#Plots
par(mfrow=c(2,2))
boxplot(GroupA[,2]~GroupA[,1], col="Red", main="Group A", 
        ylim=c(.9*min(out[,2]),1.1*max(out[,2])),
        xlab="Subject", ylab="Value")
stripchart(GroupA[,2]~GroupA[,1], vert=T, pch=16, add=T)
#abline(h=mean(GroupA[,2]), lty=2, lwd=3)

for(i in 1:length(unique(GroupA[,1]))){
  m<-mean(GroupA[which(GroupA[,1]==unique(GroupA[,1])[i]),2])
  ci<-t.test(GroupA[which(GroupA[,1]==unique(GroupA[,1])[i]),2])$conf.int[1:2]

  points(i-.2,m, pch=15,cex=1.5, col="Grey")
  segments(i-.2,
           ci[1],i-.2,
           ci[2], lwd=4, col="Grey"
  )
}
legend("topleft", legend=c("Individual Means +/- 95% CI"), bty="n", pch=15, lwd=3, col="Grey")


boxplot(GroupB[,2]~GroupB[,1], col="Light Blue", main="Group B", 
        ylim=c(.9*min(out[,2]),1.1*max(out[,2])),
        xlab="Subject", ylab="Value")
stripchart(GroupB[,2]~GroupB[,1], vert=T, pch=16, add=T)
#abline(h=mean(GroupB[,2]), lty=2, lwd=3)

for(i in 1:length(unique(GroupB[,1]))){
  m<-mean(GroupB[which(GroupB[,1]==unique(GroupB[,1])[i]),2])
  ci<-t.test(GroupB[which(GroupB[,1]==unique(GroupB[,1])[i]),2])$conf.int[1:2]

  points(i-.2,m, pch=15,cex=1.5, col="Grey")
  segments(i-.2,
           ci[1],i-.2,
           ci[2], lwd=4, col="Grey"
  )
}
legend("topleft", legend=c("Individual Means +/- 95% CI"), bty="n", pch=15, lwd=3, col="Grey")


boxplot(Summary[,2]~Summary[,1], col=c("Red","Light Blue"), xlab="Group", ylab="Average Value",
        ylim=c(.9*min(Summary[,2]),1.1*max(Summary[,2])),
        main="Individual Averages")
stripchart(Summary[,2]~Summary[,1], vert=T, pch=16, add=T)

points(.9, mean(GroupA.summary[,1]), pch=15,cex=1.5, col="Grey")
segments(.9,
         t.test(GroupA.summary[,1])$conf.int[1],.9,
         t.test(GroupA.summary[,1])$conf.int[2], lwd=4, col="Grey"
)

points(1.9, mean(GroupB.summary[,1]), pch=15,cex=1.5, col="Grey")
segments(1.9,
         t.test(GroupB.summary[,1])$conf.int[1],1.9,
         t.test(GroupB.summary[,1])$conf.int[2], lwd=4, col="Grey"
)
legend("topleft", legend=c("Group Means +/- 95% CI"), bty="n", pch=15, lwd=3, col="Grey")


hist(pvals, breaks=seq(0,1,by=.05), col="Grey",
     main=c(paste("# sims=", n.simulations),
            paste("% Sig p-values=",100*length(which(pvals<0.05))/length(pvals)))
)

Now, it seems to me that because each individual mean is an estimate itself, that we should be less certain about the group means than shown by the 95% confidence intervals indicated by the bottom-left panel in the figure above. Thus the p-values calculated are underestimating the true variability and should lead to increased false-positives if we wish to extrapolate to future data.

So what is the correct way to analyze this data?

Bonus:

The example above is a simplification. For the actual data:

1) The within-subject variance is positively correlated with the mean.

2) Values can only be multiples of two.

3) The individual results are not roughly normally distributed. They suffer from zero floor effect, and have long tails at the positive end.

4) Number of Subjects in each group are not necessarily equal.

Previous literature has used the t-test ignoring within-subject variability and other nuances as was done for the simulations above. Are these results reliable? If I can extract some means and standard errors from the figures how would I calculate the "correct" p-values.

EDIT:

Ok, here is what actual data looks like. There is also three groups rather than two:

enter image description here

dput() of data:

structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 
3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 
6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 10, 
10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 
12, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 15, 15, 15, 
15, 15, 15, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 18, 
18, 18, 18, 18, 18, 2, 0, 16, 2, 16, 2, 8, 10, 8, 6, 4, 4, 8, 
22, 12, 24, 16, 8, 24, 22, 6, 10, 10, 14, 8, 18, 8, 14, 8, 20, 
6, 16, 6, 6, 16, 4, 2, 14, 12, 10, 4, 10, 10, 8, 4, 10, 16, 16, 
2, 8, 4, 0, 0, 2, 16, 10, 16, 12, 14, 12, 8, 10, 12, 8, 14, 8, 
12, 20, 8, 14, 2, 4, 8, 16, 10, 14, 8, 14, 12, 8, 14, 4, 8, 8, 
10, 4, 8, 20, 8, 12, 12, 22, 14, 12, 26, 32, 22, 10, 16, 26, 
20, 12, 16, 20, 18, 8, 10, 26), .Dim = c(108L, 3L), .Dimnames = list(
    NULL, c("Group", "Subject", "Value")))

EDIT 2:

In response to Henrik's answer:
So if I instead perform anova followed by TukeyHSD procedure on the individual averages as shown below, I could interpret this as underestimating my p-value by about 3-4x?

My goal with this part of the question is to understand how I, as a reader of a journal article, can better interpret previous results given their choice of analysis method. For example they have those "stars of authority" showing me 0.01>p>.001. So if i accept 0.05 as a reasonable cutoff I should accept their interpretation? The only additional information is mean and SEM.

#Get Invidual Means
summary=NULL
for(i in unique(dat[,2])){
sub<-which(dat[,2]==i)
summary<-rbind(summary,cbind(
dat[sub,1][3],
dat[sub,2][4],
mean(dat[sub,3]),
sd(dat[sub,3])
)
)
}
colnames(summary)<-c("Group","Subject","Mean","SD")

TukeyHSD(aov(summary[,3]~as.factor(summary[,1])+ (1|summary[,2])))

#      Tukey multiple comparisons of means
#        95% family-wise confidence level
#    
#    Fit: aov(formula = summary[, 3] ~ as.factor(summary[, 1]) + (1 | summary[, 2]))
#    
#    $`as.factor(summary[, 1])`
#             diff       lwr       upr     p adj
#    2-1 -0.672619 -4.943205  3.597967 0.9124024
#    3-1  7.507937  1.813822 13.202051 0.0098935
#    3-2  8.180556  2.594226 13.766885 0.0046312

EDIT 3:
I think we are getting close to my understanding. Here is the simulation described in the comments to @Stephane:

#Get Subject Means
means<-aggregate(Value~Group+Subject, data=dat, FUN=mean)

#Initialize "dat2" dataframe
dat2<-dat

#Initialize within-Subject sd
s<-.001
pvals=matrix(nrow=10000,ncol=2)

for(j in 1:10000){
#Sample individual measurements for each subject
temp=NULL
for(i in 1:nrow(means)){
temp<-c(temp,rnorm(6,means[i,3], s))
}

#Set new values
dat2[,3]<-temp

#Take means of sampled values and fit to model
dd2 <- aggregate(Value~Group+Subject, data=dat2, FUN=mean)
fit2 <- lm(Value~Group, data=dd2)

#Save sd and pvalue
pvals[j,]<-cbind(s,anova(fit2)[[5]][5])

#Update sd
s<-s+.001
}

plot(pvals[,1],pvals[,2], xlab="Within-Subject SD", ylab="P-value")

enter image description here

Best Answer

I take the freedom to answer the question in the title, how would I analyze this data.

Given that we have replicates within the samples, mixed models immediately come to mind, which should estimate the variability within each individual and control for it.

Hence I fit the model using lmer from lme4. However, as we are interested in p-values, I use mixed from afex which obtains those via pbkrtest (i.e., Kenward-Rogers approximation for degrees-of-freedom). (afex also already sets the contrast to contr.sum which I would use in such a case anyway)

To control for the zero floor effect (i.e., positive skew), I fit two alternative versions transforming the dependent variable either with sqrt for mild skew and log for stronger skew.

require(afex)

# read the dput() in as dat <- ...    
dat <- as.data.frame(dat)
dat$Group <- factor(dat$Group)
dat$Subject <- factor(dat$Subject)

(model <- mixed(Value ~ Group + (1|Subject), dat))
##        Effect    stat ndf ddf F.scaling p.value
## 1 (Intercept) 237.730   1  15         1  0.0000
## 2       Group   7.749   2  15         1  0.0049

(model.s <- mixed(sqrt(Value) ~ Group + (1|Subject), dat))
##        Effect    stat ndf ddf F.scaling p.value
## 1 (Intercept) 418.293   1  15         1  0.0000
## 2       Group   4.121   2  15         1  0.0375

(model.l <- mixed(log1p(Value) ~ Group + (1|Subject), dat))
##        Effect    stat ndf ddf F.scaling p.value
## 1 (Intercept) 458.650   1  15         1  0.0000
## 2       Group   2.721   2  15         1  0.0981

The effect is significant for the untransformed and sqrt dv. But are these model sensible? Let's plot the residuals.

png("qq.png", 800, 300, units = "px", pointsize = 12)
par(mfrow = c(1, 3))
par(cex = 1.1)
par(mar = c(2, 2, 2, 1)+0.1)
qqnorm(resid(model[[2]]), main = "original")
qqline(resid(model[[2]]))
qqnorm(resid(model.s[[2]]), main = "sqrt")
qqline(resid(model.s[[2]]))
qqnorm(resid(model.l[[2]]), main = "log")
qqline(resid(model.l[[2]]))
dev.off()

enter image description here

It seems that the model with sqrt trasnformation provides a reasonable fit (there still seems to be one outlier, but I will ignore it). So, let's further inspect this model using multcomp to get the comparisons among groups:

require(multcomp)

# using bonferroni-holm correction of multiple comparison
summary(glht(model.s[[2]], linfct = mcp(Group = "Tukey")), test = adjusted("holm"))
##          Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lmer(formula = sqrt(Value) ~ Group + (1 | Subject), data = data)
## 
## Linear Hypotheses:
##            Estimate Std. Error z value Pr(>|z|)  
## 2 - 1 == 0  -0.0754     0.3314   -0.23    0.820  
## 3 - 1 == 0   1.1189     0.4419    2.53    0.023 *
## 3 - 2 == 0   1.1943     0.4335    2.75    0.018 *
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## (Adjusted p values reported -- holm method)

# using default multiple comparison correction (which I don't understand)
summary(glht(model.s[[2]], linfct = mcp(Group = "Tukey")))
##          Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lmer(formula = sqrt(Value) ~ Group + (1 | Subject), data = data)
## 
## Linear Hypotheses:
##            Estimate Std. Error z value Pr(>|z|)  
## 2 - 1 == 0  -0.0754     0.3314   -0.23    0.972  
## 3 - 1 == 0   1.1189     0.4419    2.53    0.030 *
## 3 - 2 == 0   1.1943     0.4335    2.75    0.016 *
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## (Adjusted p values reported -- single-step method)

Punchline: group 3 differs from the other two groups which do not differ among each other.

	Group A	Group B
Yes	350	1700
No	1250	3800

	Group A	Group B
Yes	$a$	$b$
No	$c$	$d$

The Chi Square Test

Perhaps the most common test for 2 by 2 tables is the chi square test. Roughly, the null hypothesis of the chi square test is that the proportion of people who answer yes is the same in each group, and in particular it is the same as the proportion of people who answer yes were I to ignore groups completely.

The test statistic is

$$ X^2_P = \dfrac{(ad-bc)^2N}{n_1n_2m_1m_2} \sim \chi^2_1$$

Here $n_i$ are the column totals and $m_i$ are the row totals. This test statistic is asymptotically distributed as Chi square (hence the name) with one degree of freedom.

The math is not important, to be frank. Most software packages, like R, implement this test readily.

m = matrix(c(350,1250, 1700, 3800), nrow=2)
chisq.test(m, correct = F)
    Pearson's Chi-squared test

data:  m
X-squared = 49.257, df = 1, p-value = 2.246e-12

The correct=F is so that R implements the test as I have written it and does not apply a continuity correction which is useful for small samples. The p value is very small here so we can conclude that the proportion of people who answer yes in each group is different.

Test of Proportions

The test of proportions is similar to the chi square test. Let $\pi_i$ be the probability of answering Yes in group $i$. The test of proportions tests the null that $\pi_1 = \pi_2$.

In short, the test statistic for this test is

$$ z = \dfrac{p_1-p_2}{\sqrt{\dfrac{p_1(1-p_1)}{n_1} + \dfrac{p_2(1-p_2)}{n_2}}} \sim \mathcal{N}(0,1) $$

Again, $n_i$ are the column totals and $p_1 = a/n_1$ and $p_2=b/n_2$. This test statistic has standard normal asymptotic distribution. If your alternative is that $p_1 \neq p_2$ then you want this test statistic to be larger than 1.96 in absolute value in most cases to reject the null.

In R

# Note that the n argument is the column sums

prop.test(x=c(350, 1700), n=c(1600, 5500), correct = F)
data:  c(350, 1700) out of c(1600, 5500)
X-squared = 49.257, df = 1, p-value = 2.246e-12
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.11399399 -0.06668783
sample estimates:
   prop 1    prop 2 
0.2187500 0.3090909

Note that the X-squared statistic in the output of this test is identical to the chi-square test. There is a good reason for that which I will not talk about here. Note also that this test provides a confidence interval for the difference in proportions, which is an added benefit over the chi square test.

Fisher's Exact Test

Fisher's exact test conditions on the quantites $n_1 = a+c$ and $m_1 = a + b$. The null of this test is that the probability of success in each group is the same, $\pi_1 = \pi_2$, like the test of proportions. The actual null hypothesis in the derivation of the test is about the odds ratio, but that is not important now.

The exact probability of observing the table provided is

$$ p = \dfrac{n_1! n_2! m_1! m_2!}{N! a! b! c! d!} $$

John Lachin writes

Thus, the probability of the observed table can be considered to arise from a collection of $N$ subjects of whom $m_1$ have positive response, with $a$ of these being drawn from the $n_1$ subjects in group 1 and $b$ from among the $n_2$ subjects in group 2 ($a+b=m_1$, $n_1 + n_2 = N$).

Importantly, this is not the p value. It is the probability of observing this table. In order to compute the p value, we need to sum up probabilities of observing tables which are more extreme than this one.

Luckily, R does this for us

m = matrix(c(350,1250, 1700, 3800), nrow=2)
fisher.test(m)

    Fisher's Exact Test for Count Data

data:  m
p-value = 1.004e-12
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.5470683 0.7149770
sample estimates:
odds ratio 
 0.6259224

Note the result is about odds ratios and not about probabilities in each group. It is also worth noting, again from Lachin,

The Fisher-Irwin exact test has been criticized as being too conservative because other unconditional tests have been shown to yield a smaller p value and thus are more powerful.

When the data are large, this point becomes moot because you've likely got enough power to detect small effects, but it all depends on what you're trying to test (as it always does).

Thus far, we have examined what are likely to be the most prevalent tests for this sort of data. The following tests are equivalent to the first two, but are perhaps less known. I present them here for completeness.

Odds Ratio

The odds ratio $\widehat{OR}$ for this table is $ad/bc$, but because the odds ratio is bound to be strictly positive, it can be more convenient to work with the log odds ratio $\log(\widehat{OR})$.

Asymptotically, the sampling distribution for the log odds ratio is normal. This means we can apply a simple $z$ test. Our test statistic is

$$ Z = \dfrac{\log(\widehat{OR}) - \log(OR)}{\sqrt{\hat{V}(\log(\widehat{OR})}} $$.

Here, $\hat{V}(\log(\widehat{OR}))$ is the estimated variance of the log odds ratio and is equal to $1/a + 1/b + 1/c + 1/d$.

In R


odds_ratio = m[1, 1]*m[2, 2]/(m[2, 1]*m[1, 2])
vr = sum(1/m)
Z = log(odds_ratio)/sqrt(vr)

p.val = 2*pnorm(abs(Z), lower.tail = F)

which returns a Z value of -6.978754 and a p value less than 0.01.

Cochran's test

The test statistic is

$$ X^2_u = \dfrac{\dfrac{n_2a-n_1b}{N}}{\dfrac{n_1n_2m_1m_2}{N^3}} \sim \chi^2_1 $$

In R


m = matrix(c(350,1250, 1700, 3800), nrow=2)
a = 350 
b = 1700
c = 1250
d = 3800
N = a+b+c+d
n1 = a+c
n2 = b+d
m1 =a+b
m2 =c+d
X = ((n2*a-n1*b)/N)^2 /((n1*n2*m1*m2)/N^3)

# Look familiar?
X
>>>49.25663

p.val = pchisq(X,1, lower.tail=F)
p.val 
>>>[1] 2.245731e-12

Conditional Mantel-Haenszel (CMH) Test

The CMH Test (I think I've seen this called the Cochran Mantel-Haenszel Test elsewhere) is a test which conditions on the first column total and first row total.

The test statistic is

$$ X^2_c = \dfrac{\left( a - \dfrac{n_1m_1}{N} \right)^2}{\dfrac{n_1n_2m_1m_2}{N^2(N-1)}} \sim \chi^2_1$$

In R


a = 350 
b = 1700
c = 1250
d = 3800
N = a+b+c+d
n1 = a+c
n2 = b+d
m1 =a+b
m2 =c+d


top =( a - n1*m1/N)^2
bottom = (n1*n2*m1*m2)/(N^2*(N-1))
X = top/bottom

X
>>>49.24969

p.val = pchisq(X, 1, lower.tail = F)
p.val
>>> [1] 2.253687e-12

Likelihood Ratio Test (LRT) (My Personal Favourite)

The LRT compares the difference in log likelihood between a model which freely estimates the group proportions and a model which only estimates a single proportion (not unlike the chi-square test). This test is a bit overkill in my opinion as other tests are simpler, but hey why not include it? I like it personally because the test statistic is oddly satisfying and easy to remember

The math, as before, is irrelevant for our purposes. The test statistic is

$$ X^2_G = 2 \log \left( \dfrac{a^a b^b c^c d^d N^N}{n_1^{n_1} n_2^{n_2} m_1^{m_1} m_2^{m_2}} \right) \sim \chi^2_1 $$

In R with some applied algebra to prevent overflow



a = 350 
b = 1700
c = 1250
d = 3800
N = a+b+c+d
n1 = a+c
n2 = b+d
m1 =a+b
m2 =c+d

top = c(a,b,c,d,N)
bottom = c(n1, n2, m1, m2) 

X = 2*log(exp(sum(top*log(top)) - sum(bottom*log(bottom))))

# Very close to other tests
X
>>>[1] 51.26845

p.val = pchisq(X, 1, lower.tail=F)
p.val
>>>1] 8.05601e-13

Note that there is a discrepancy in the test statistic for the LRT and the other tests. It has been noted that this test statistic converges to teh asymptotic chi square distribution at a slower rate than the chi square test statistic or the Cochran's test statistic.

What Test Do I Use

My suggestion: Test of proportions. It is equivalent to the chi-square test and has the added benefit of being a) directly interpretable in terms of risk difference, and b) provides a confidence interval for this difference (something you should always be reporting).

I've not included theoretical motivations for these tests, though understanding those are not essential but captivating in my own opinion.

If you're wondering where I got all this information, the book "Biostatsitical Methods - The Assessment of Relative Risks" by John Lachin takes a painstakingly long time to explain all this to you in chapter 2.

Best Answer

Related Solutions

Solved – Two test groups with multiple measurements vs a single reference value

Solved – Significance test for two groups with dichotomous variable

The Chi Square Test

Test of Proportions

Fisher's Exact Test

Odds Ratio

Cochran's test

Conditional Mantel-Haenszel (CMH) Test

Likelihood Ratio Test (LRT) (My Personal Favourite)

What Test Do I Use

Related Question