Solved – How to use a linear model with two factors and repeated measures

linear modellmrrepeated measures

Suppose I have a date set of the form:

Subject / Sex / No. of mistakes - morning / No. of mistakes - afternoon
A / M / 2 / 5
B / F / 1 / 4
C / M / 3 / 5
D / F / 1 / 5

Suppose that I want to model number of mistakes as a linear function of the categorical variables time and sex, how can I do this using the lm() function in R? The only thing I can think of is to edit the data frame with so that the new columns are: No. of mistakes / Morning or afternoon / Sex / Subject. Then we will have 8 row entries of data in this new data frame. However, here we are not grouping the same morning / afternoon exercise together if they came from the same subject, so as the test subject is not part of the linear model we are losing some information (introducing increased variance, etc.).

Could someone please explain:

How to use R here with the lm function
If we do indeed choose not to account for the factor that the afternoon / morning mistakes are actually grouping in pairs (coming from the same test subject) when creating the linear model, then how is this a fair way to model? My guess would be that as the data is from a completely randomised design which is balanced, this effect is randomised and so does not add any bias to the data. Could anyone verify / correct me on this, perhaps with a more formal argument?

Best Answer

You're right that not incorporating subject somehow will violate the assumptions of the model. In general, the point estimates will be unbiased, but your interval estimates (confidence intervals, p-values, etc.) will not have the right width. Typically, they would be too narrow, although they can be too wide sometimes. There are several ways to deal with this.

First, you could include subject as a factor in your model. That is, just as you estimate an effect of male vs. female, you estimate the effect of being subject A vs. being subject B, etc. This is a fixed effect. In your case, this really won't work because you don't have enough data on each subject and a model that tries to fit time, sex and subject will have used all its degrees of freedom and not be identifiable.

Another way to include subject is to use a random effect. What this means is that you consider the subjects in your study to be a random sample from a population of interest and instead of estimating an effect associated with each subject, you estimate the mean and variance (SD) of the population. With your four subjects, the fixed effect approach will consume 3 degrees of freedom, using random effects consumes (in some sense) 2 degrees of freedom. (Actually, that isn't quite true: the effect of the random effects on the model's degrees of freedom is a difficult topic, but that statement is good enough for our purposes.) Many models with repeated measures would be saturated or worse if the subjects were modeled as fixed effects and random effects are a common strategy for dealing with such situations.

A final strategy, since you have only two measures, is to subtract them and use the differences as your response. Then your test is a one-sample test of whether the differences are equal to zero. If they are significantly greater than 0, you know that afternoons are associated with more mistakes than mornings, and if the differences are <0, then vise versa. Taking differences uses each subject as their own control, so subject doesn't have to be controlled for with fixed or random effects. In your case, this approach will be a little more complicated because you also want to test for the effect of sex, but this version could still be used.

Some other notes: You will have great difficulty doing anything with those data (if they are actually your real data) because they are so few. In addition, ordinary linear models (i.e., lm) would not be appropriate for those data because they are counts. You should use a Poisson type model instead (i.e., glm, or in this case, glmer).

Here is a quick analysis in R, using random effects and a Poisson GLMM:

##### entering your data into a data frame:  
my.data = "Subject / Sex / morning / afternoon
A / M / 2 / 5
B / F / 1 / 4
C / M / 3 / 5
D / F / 1 / 5"
my.data = gsub(my.data, pattern="\n", replacement=" / ")
my.data = unlist(strsplit(my.data, split=" / "))
my.data = matrix(my.data, ncol=4, byrow=T)
colnames(my.data) = my.data[1,]
my.data = as.data.frame(my.data[-1,])
my.data$morning   = as.numeric(as.character(my.data$morning))
my.data$afternoon = as.numeric(as.character(my.data$afternoon))
my.data
#   Subject Sex morning afternoon
# 1       A   M       2         5
# 2       B   F       1         4
# 3       C   M       3         5
# 4       D   F       1         5

##### creating a 'long form' dataset:
library(reshape2)
my.data.l = melt(my.data, id.vars=c("Subject", "Sex"), 
                 measure.vars=c("morning", "afternoon"),
                 variable.name="time", value.name="mistakes")
my.data.l
#   Subject Sex      time mistakes
# 1       A   M   morning        2
# 2       B   F   morning        1
# 3       C   M   morning        3
# 4       D   F   morning        1
# 5       A   M afternoon        5
# 6       B   F afternoon        4
# 7       C   M afternoon        5
# 8       D   F afternoon        5
detach(package:reshape2, unload=T)

##### this is the analysis of your data:
library(lme4)
model = glmer(mistakes~time+Sex+(1|Subject), my.data.l, family=poisson)
summary(model)
# Generalized linear mixed model fit by maximum likelihood ['glmerMod']
#  Family: poisson ( log )
# Formula: mistakes ~ time + Sex + (1 | Subject) 
#    Data: my.data.l 
# 
#      AIC      BIC   logLik deviance 
#  32.3889  32.7067 -12.1945  24.3889 
# 
# Random effects:
#   Groups  Name        Variance  Std.Dev. 
#   Subject (Intercept) 5.704e-13 7.552e-07
# Number of obs: 8, groups: Subject, 4
# 
# Fixed effects:
#               Estimate Std. Error z value Pr(>|z|)  
# (Intercept)     0.3926     0.4419   0.888   0.3744  
# timeafternoon   0.9985     0.4421   2.258   0.0239 *
# SexM            0.3102     0.3970   0.781   0.4346  
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Correlation of Fixed Effects:
#             (Intr) tmftrn
# timeafternn -0.731       
# SexM        -0.518  0.000

Related Solutions

Solved – Repeated-measures, crossover analysis using linear mixed model in R

Every observation is an individual sprint if I understand you correctly. Thus, there are as many observations as there are sprints. This means that you cannot include the individual sprint as an fixed effect as this would give a perfect fit.

It cannot be included as a random effect either. This is because the factor describing the individual sprint is identical to the factor consisting of observation numbers. The latter factor is what is used for the error term of your model. Thus, it will not be identifiable what portion comes from each of the two variance componenents as they are defined by the same factor.

If you try fitting a model with this issue using ${\tt lmer}$ of the ${\tt lme4}$ you'll get an error telling you that grouping factors of random effects should have fewer levels than there are observations in data. Using ${\tt lme}$ of ${\tt nlme}$ you can fit the model but it'll just split the variation between the two variance components in some arbitrary way (to the best of my knowledge).

Actually, you could say that the above model already takes the individual sprint into account. One is just not able to seperate the effect from the measurement error.

Solved – How to build a Generalized Linear Mixed Model with repeated measures in R

'Score' is my response variable, and consists of counts, so I'm thinking this represents a Poisson distribution.

That's a reasonable first guess, although you should definitely check for overdispersion (after running the model)

However, I don't think this accounts for any nesting of the data, and again, I'm not sure if that is important here. Also, I'm not sure if 'Day' also needs to be incorporated as a random effect.

I think you're fine. The fact that individuals are nested within species gets take care of automatically. You might want (1) to include variation among individuals in their learning rate, i.e. use (Day|Subject); (2) allow for temporal trends other than (log-)linear, using polynomials (poly()) or splines (splines::ns()) or generalized additive mixed models (gamm4 package).

I also have sex information about all of my subjects ... I assume sex could also be considered as a random effect?

You should probably include sex as a fixed effect, e.g.

Score~Species*Sex*Day + (Day|Subject)

Best Answer

Related Solutions

Solved – Repeated-measures, crossover analysis using linear mixed model in R

Solved – How to build a Generalized Linear Mixed Model with repeated measures in R

Related Question