Solved – How to use a linear model with two factors and repeated measures

linear modellmrrepeated measures

Suppose I have a date set of the form:

Subject / Sex / No. of mistakes - morning / No. of mistakes - afternoon
A / M / 2 / 5
B / F / 1 / 4
C / M / 3 / 5
D / F / 1 / 5

Suppose that I want to model number of mistakes as a linear function of the categorical variables time and sex, how can I do this using the lm() function in R? The only thing I can think of is to edit the data frame with so that the new columns are: No. of mistakes / Morning or afternoon / Sex / Subject. Then we will have 8 row entries of data in this new data frame. However, here we are not grouping the same morning / afternoon exercise together if they came from the same subject, so as the test subject is not part of the linear model we are losing some information (introducing increased variance, etc.).

Could someone please explain:

  1. How to use R here with the lm function
  2. If we do indeed choose not to account for the factor that the afternoon / morning mistakes are actually grouping in pairs (coming from the same test subject) when creating the linear model, then how is this a fair way to model? My guess would be that as the data is from a completely randomised design which is balanced, this effect is randomised and so does not add any bias to the data. Could anyone verify / correct me on this, perhaps with a more formal argument?

Best Answer

You're right that not incorporating subject somehow will violate the assumptions of the model. In general, the point estimates will be unbiased, but your interval estimates (confidence intervals, p-values, etc.) will not have the right width. Typically, they would be too narrow, although they can be too wide sometimes. There are several ways to deal with this.

First, you could include subject as a factor in your model. That is, just as you estimate an effect of male vs. female, you estimate the effect of being subject A vs. being subject B, etc. This is a fixed effect. In your case, this really won't work because you don't have enough data on each subject and a model that tries to fit time, sex and subject will have used all its degrees of freedom and not be identifiable.

Another way to include subject is to use a random effect. What this means is that you consider the subjects in your study to be a random sample from a population of interest and instead of estimating an effect associated with each subject, you estimate the mean and variance (SD) of the population. With your four subjects, the fixed effect approach will consume 3 degrees of freedom, using random effects consumes (in some sense) 2 degrees of freedom. (Actually, that isn't quite true: the effect of the random effects on the model's degrees of freedom is a difficult topic, but that statement is good enough for our purposes.) Many models with repeated measures would be saturated or worse if the subjects were modeled as fixed effects and random effects are a common strategy for dealing with such situations.

A final strategy, since you have only two measures, is to subtract them and use the differences as your response. Then your test is a one-sample test of whether the differences are equal to zero. If they are significantly greater than 0, you know that afternoons are associated with more mistakes than mornings, and if the differences are <0, then vise versa. Taking differences uses each subject as their own control, so subject doesn't have to be controlled for with fixed or random effects. In your case, this approach will be a little more complicated because you also want to test for the effect of sex, but this version could still be used.

Some other notes: You will have great difficulty doing anything with those data (if they are actually your real data) because they are so few. In addition, ordinary linear models (i.e., lm) would not be appropriate for those data because they are counts. You should use a Poisson type model instead (i.e., glm, or in this case, glmer).

Here is a quick analysis in R, using random effects and a Poisson GLMM:

##### entering your data into a data frame:  
my.data = "Subject / Sex / morning / afternoon
A / M / 2 / 5
B / F / 1 / 4
C / M / 3 / 5
D / F / 1 / 5"
my.data = gsub(my.data, pattern="\n", replacement=" / ")
my.data = unlist(strsplit(my.data, split=" / "))
my.data = matrix(my.data, ncol=4, byrow=T)
colnames(my.data) = my.data[1,]
my.data = as.data.frame(my.data[-1,])
my.data$morning   = as.numeric(as.character(my.data$morning))
my.data$afternoon = as.numeric(as.character(my.data$afternoon))
my.data
#   Subject Sex morning afternoon
# 1       A   M       2         5
# 2       B   F       1         4
# 3       C   M       3         5
# 4       D   F       1         5

##### creating a 'long form' dataset:
library(reshape2)
my.data.l = melt(my.data, id.vars=c("Subject", "Sex"), 
                 measure.vars=c("morning", "afternoon"),
                 variable.name="time", value.name="mistakes")
my.data.l
#   Subject Sex      time mistakes
# 1       A   M   morning        2
# 2       B   F   morning        1
# 3       C   M   morning        3
# 4       D   F   morning        1
# 5       A   M afternoon        5
# 6       B   F afternoon        4
# 7       C   M afternoon        5
# 8       D   F afternoon        5
detach(package:reshape2, unload=T)

##### this is the analysis of your data:
library(lme4)
model = glmer(mistakes~time+Sex+(1|Subject), my.data.l, family=poisson)
summary(model)
# Generalized linear mixed model fit by maximum likelihood ['glmerMod']
#  Family: poisson ( log )
# Formula: mistakes ~ time + Sex + (1 | Subject) 
#    Data: my.data.l 
# 
#      AIC      BIC   logLik deviance 
#  32.3889  32.7067 -12.1945  24.3889 
# 
# Random effects:
#   Groups  Name        Variance  Std.Dev. 
#   Subject (Intercept) 5.704e-13 7.552e-07
# Number of obs: 8, groups: Subject, 4
# 
# Fixed effects:
#               Estimate Std. Error z value Pr(>|z|)  
# (Intercept)     0.3926     0.4419   0.888   0.3744  
# timeafternoon   0.9985     0.4421   2.258   0.0239 *
# SexM            0.3102     0.3970   0.781   0.4346  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Correlation of Fixed Effects:
#             (Intr) tmftrn
# timeafternn -0.731       
# SexM        -0.518  0.000