GLMM – Effect of Random Effects with Only One Observation in Generalized Linear Mixed Model

I have a data set in which the variable I'd like to use as a random effect only has a single observation for some levels. Based on the answers to previous questions, I've gathered that, in principle, this can be fine.

Can I fit a mixed model with subjects that only have 1 observation?

Random intercepts model – one measurement per subject

However, in the second link, the first answer states:

"…assuming you are not using a generalized linear mixed model GLMM where in that case issues of over-dispersion come into play"

I am considering using a GLMM, but I don't really understand how random effect levels with single observations will affect the model.

Here is an example of one of the models I'm trying to fit. I'm studying birds, and I'd like to model the effects of population and season on the number of stops during migration. I'd like to use individual as a random effect, because for some individuals I have up to 5 years of data.

library(dplyr)
library(lme4)
pop <- as.character(c("BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "BF", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "MA", "NU", "NU", "NU", "NU", "NU", "NU", "NU", "NU", "NU", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA", "SA"))
id <- "2 2 4 4 7 7 9 9 10 10 84367 84367 84367 84368 84368 84368 84368 84368 84368 84369 84369 33073 33073 33073 33073 33073 33073 33073 33073 33073 80149 80149 80149 80150 80150 80150 57140 57141 126674 126677 126678 126680 137152 137152 137157 115925 115925 115925 115925 115925 115925 115925 115925 115926 115926 115926 115926 115926 115926 115927 115928 115929 115929 115929 115930 115930 115930 115930 115931 115931 115931 115932 115932 115932"
id <- strsplit(id, " ")
id <- as.numeric(unlist(id))
year <- "2014 2015 2014 2015 2014 2015 2014 2015 2014 2015 2009 2010 2010 2009 2010 2010 2011 2011 2012 2009 2010 2009 2009 2010 2010 2011 2011 2012 2012 2013 2008 2008 2009 2008 2008 2009 2008 2008 2013 2013 2013 2013 2014 2015 2014 2012 2013 2013 2014 2014 2015 2015 2016 2012 2013 2013 2014 2014 2015 2013 2012 2012 2013 2013 2012 2013 2013 2014 2013 2014 2014 2013 2014 2014"
year <- strsplit(year, " ")
year <- as.numeric(unlist(year))
season <- as.character(c("fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "fall", "spring", "fall", "fall", "spring", "fall", "spring", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "fall", "spring", "spring", "fall", "spring", "spring", "fall", "spring", "spring", "fall", "fall", "fall", "fall", "fall", "fall", "fall", "spring", "fall", "fall", "fall", "spring", "fall", "spring", "fall", "spring", "spring", "fall", "fall", "spring", "fall", "spring", "spring", "fall", "fall", "fall", "fall", "spring", "fall", "fall", "spring", "spring","fall", "fall", "spring", "fall", "fall", "spring"))
stops <- "0 0 0 0 0 0 1 0 2 1 1 0 0 3 2 0 1 1 0 1 1 2 0 1 0 2 0 4 0 0 2 1 1 2 5 2 1 0 9 6 2 3 4 7 2 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0"
stops <- strsplit(stops, " ")
stops <- as.numeric(unlist(stops))

stopdata <- data.frame(pop = pop, id = id, year = year, season = season, stops = stops, stringsAsFactors = FALSE)


stopdata <- group_by(stopdata, pop, id)
summary1 <- summarise(stopdata, n.years = length(year))
table(summary1$n.years)

There are 27 individuals. 9 individuals have a single observations. 18 individuals have 2-9 observations.

What should be concerned about if 1/3 of the random effect levels only have one observation?

I've been considering:

Option 1: GLMM as described above

stops.glmm <- glmer(stops ~ pop + season + (1|id), data=stopdata, family = poisson)

Option 2: Weighted generalized linear model GLM using means for the individuals with multiple observations

aggfun <- function(data, idvars=c("pop", "season", "id"), response){
#select id variables, response variable, and year
sub1 <- na.omit(data[,c(idvars, "year", response)])
#aggregate for mean response by year
agg1 <- aggregate(sub1[names(sub1) == response],by=sub1[idvars],FUN=mean)
#sample size for each aggregated group
aggn <- aggregate(sub1[response],by=sub1[idvars],FUN=length)
#rename sample size column
names(aggn)[4] <- "n"
agg2 <- merge(agg1, aggn)
agg2}


#Create weighted dataset
stops.weight <- aggfun(data = stopdata, response = "stops")
stops.weight$stops <- round(stops.weight$stops)

#Weighted GLM
stops.glm <- glm(stops~pop + season, data=stops.weight, family = poisson, weights = n)

Best Answer

In general, you have an issue with identifiability. Linear models with a random effect assigned to a parameter with only one measurement can't distinguish between the random effect and the residual error.

A typical linear mixed effect equation will look like:

$E = \beta + \eta_i + \epsilon_j$

Where $\beta$ is the fixed effect, $\eta_i$ is the random effect for level $i$, and $\epsilon_j$ is the residual variability for the $j$th measurement. When you have only one observation of a level with a random effect, it is difficult to distinguish between $\eta$ and $\epsilon$. You will (typically) be fitting a variance or standard deviation to $\eta$ and $\epsilon$, so with only one measure per individual, you will be not be as certain that you have an accurate estimate for $SD(\eta)$ and $SD(\epsilon)$, but the estimate of the sum of the variances ($var(\eta) + var(\epsilon)$) should be relatively robust.

On to the practical answer: If you have about 1/3 of your observations with a single observation per individual, you are probably OK overall. The rest of the population should provide a reasonable estimate for $SD(\eta)$ and $SD(\epsilon)$, and these individuals should be minor contributors overall. On the other hand, if you have all individuals at a specific fixed effect and random effect with a single measure (e.g. for your example, perhaps all of a population-- perhaps that means species for you), then you would trust the result less.

Best Answer

Related Solutions

Related Question