Solved – Relationship between paired t test and simple mixed model

mixed modelrt-test

The bounty I placed on this question expires in the next 24 hours.

I have a psychological data set which, traditionally, would be analysed using a paired samples t test.
The design of the experiment is $39 (subjects) \times 7 (targets) \times 2 (conditions)$, and I'm interested in the difference in a given variable between the conditions.

The traditional approach has been to average across targets so that I have 2 observations per participant, and then compare these averages using a paired t test.

I wanted to use a mixed models approach, as has become increasingly popular in this field (i.e. Baayen, Davidson & Bates, 2008), and so the first model I fit, which I thought should approximate the results of the t test, was one with $condition$ as a fixed effect, and random intercepts for $subjects$ (i.e. $var = \alpha + \beta*condition + Intercept(subject) + \epsilon$. Obviously, the full model would also include random intercepts for $targets$.

However, I'm struggling to understand why I achieve pretty divergent results between the two approaches.
Can anyone explain what's going on here?
I've also seen (what I understand to be) a similar question asked here, with an answer about correlation structure which I'm not equipped to understand. If this is also what's at issue here, I would appreciate if anyone could suggest some resources to read up on this.

Edit: I've posted the example data, and R script, here.

Edit #2 – Bounty added

Some additional points:

  • I'm only analysing the correct responses (think of it as analogous to reaction time), so there are missing cases – not every participant provides 7 data points per condition.
    • When I analyse all responsees, rather than just the correct ones, the difference between the two results is reduced, but not eliminated. This suggests to me that the missing cases are a factor here.
  • The variable isn't normally distributed. In my final model, I scale it using a Box-Cox transformation, but I omit that here for consistency with the t test.
  • As pointed out by @PeterFlom, the $df$s differ hugely between the two approaches, but I assume this to be because the t test is being applied to the aggregate data (2 observations per participant, 1 per condition), while the mixed model is applied to raw scores ($<14$ observations per participant, $<7$ per condition).
  • @BenBolker notes that the t values also differ pretty considerably.

My analysis code is below.

>library(dplyr)
>subject_means = group_by(data, subject, condition) %>% summarise(var=mean(var))
>t.test(var ~ condition, data=subject_means, paired=T)

    Paired t-test

data:  var by condition
t = -1.3394, df = 37, p-value = 0.1886
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.14596388  0.02978745
sample estimates:
mean of the differences 
            -0.05808822 

>library(lme4)
>lm.0 = lmer(var ~ (1|subject), data=data)
>lm.1 = lmer(var ~ condition + (1|subject), data=data)
>anova(lm.0, lm.1)

Data: data
Models:
object: var ~ (1 | subject)
..1: var ~ condition + (1 | subject)
       Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
object  3 489.09 501.23 -241.55   483.09                           
..1     4 485.81 502.00 -238.90   477.81 5.2859      1     0.0215 *

>library(lmerTest)
>summary(lm.1)$coef

              Estimate Std. Error        df  t value     Pr(>|t|)
(Intercept) 0.11862462 0.02878027  98.60659 4.121734 7.842075e-05
condition   0.09580546 0.04161237 400.27441 2.302331 2.182890e-02

Notice, specifically, the jump in the p value from $p = .188$ in the t test, to $p = .021$ from either lmer method.


I've tried, and failed to provide a reproducible example of this, using the anorexia dataset in the MASS package, so I would assume the problem is something idiosyncratic to my data, but I don't understand what.

# Borrowing from http://ww2.coastal.edu/kingw/statistics/R-tutorials/dependent-t.html
>data(anorexia, package="MASS")
>ft = subset(anorexia, subset=(Treat=="FT"))
>wgt = c(ft$Prewt, ft$Postwt)
>pre.post = rep(c("pre","post"),c(17,17))
>subject = rep(LETTERS[1:17],2)
>t.test(wgt~pre.post, data=ft.new, paired=T)

    Paired t-test

data:  wgt by pre.post
t = 4.1849, df = 16, p-value = 0.0007003
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  3.58470 10.94471
sample estimates:
mean of the differences 
               7.264706 

>m = lmer(wgt ~ pre.post + (1|subject), data=ft.new)
>summary(m)$coef

             Estimate Std. Error       df   t value     Pr(>|t|)
(Intercept) 90.494118   1.689013 26.17129 53.578096 0.0000000000
pre.postpre -7.264706   1.735930 15.99968 -4.184908 0.0007002806

Best Answer

I think the problem is the way the paired t test is computed. Try this:

t.test(all_by_sub$var[all_by_sub$condition==1], all_by_sub$var[all_by_sub$condition==0], paired=TRUE)

This gives:

data:  all_by_sub$var[all_by_sub$condition == 1] and all_by_sub$var[all_by_sub$condition == 0]
t = 2.0529, df = 37, p-value = 0.0472
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.0009400428 0.1435297005
sample estimates:
mean of the differences 
             0.07223487 

Obviously, t.test computes x minus y in the paired t test. That's why the sign of the estimate and the t value was reversed. Beyond that, both the estimate (0.072 vs. 0.082) and the t value (2.05 vs. 2.19) of the mixed model are very close to the results of the t test:

            Estimate Std. Error      df t value Pr(>|t|)
(Intercept)    0.118      0.028 106.772   4.159    0.000
condition      0.082      0.037 462.992   2.192    0.029
Related Question