Suppose you are measuring temperature $T_{ij}$ for $i =1, \dots ,4$ subjects and $j= 1, \dots ,4$ time points. For subject 1, suppose $T_{12}$ and $T_{14}$ were missing. Would you omit the entire record for subject 1 if you are running a linear mixed model?

# Solved – Missing observations in a linear mixed model

missing datamixed model

#### Related Solutions

Here is a background paper, "Normal distribution based pseudo ML for missing data: With applications to mean and covariance structure analysis" with full text available at http://www.sciencedirect.com/science/article/pii/S0047259X09001079 .

Per this reference to quote some pertinent comments:

"When the population follows a confirmatory factor model, and data are missing due to the magnitude of the factors, the MLE may not be consistent even when data are normally distributed. When data are missing due to the magnitude of measurement errors/uniqueness, MLEs for many of the covariance parameters related to the missing variables are still consistent."

The aforementioned paper identifies and discusses factors that impact the asymptotic biases of the MLE for data that are not missing at random.

A technique that I believe has value in providing a solution is to replace the missing value, see "What to Do about Missing Values in Time-Series Cross-Section Data", full text available at https://www.google.com/url?sa=t&source=web&rct=j&ei=Ecb4U5rgEfLfsAS_BQ&url=http://gking.harvard.edu/files/pr.pdf&cd=10&ved=0CDcQFjAJ&usg=AFQjCNGpnhU_8okmEeNqnvRLCppKDEromw&sig2=Q5pAcvzVVH4IWmzFQFPe8A

To quote the author, the paper suggests the "concept of “multiple imputation,” a well-accepted and increasingly common approach to missing data problems in many fields. The idea is to extract relevant information from the observed portions of a data set via a statistical model,to impute multiple (around five) values for each missing cell,and to use these to construct multiple “completed” datasets."

Your original model:

$Y_{si} = \beta_0 + S_{0s} + (β_{1} + S_{1s})X_{1si} + β_{2}X_{2si} + β_{3}X_{3si} + β_{4}X_{1si}X_{2si} + β_{5}X_{2si}X_{3si} + \epsilon_{si}$ where $s = 1,..., S$, indicates the subject, $i=1,..I_s$ indicates the measurement, $X_{1si}$ is day of year, $X_{2si}$ is factor and $X_{3si}$ = temperature, $\epsilon_{si} ~ N(0, σ^2)$ and $(S_{0s} S_{1s})'= N\left((0,0)', \left(\matrix{\sigma_1^2& \sigma_{12}\\ \sigma_{12}&\sigma_2^2}\right)\right)$. $\beta_0,...\beta_5$ are fixed effects.

For $X_{1si}$, it is 1 for Jan 1, xxxx, and 365 (or 366) for Dec 31, xxxx? If it is true, maybe periodic function is needed, or need to drop it, because the difference between means of $Y{si}$ at Jan 1, 2016 and Dec 31, 2015 is $365\beta_1$ and it may be not true.

I think your random slope should be on $X_{3si}$, instead of on $X_{1si}$ Maybe you can fit a model like this $Y_{si} = \beta_0 + S_{0s} + β_{1}X_{1si} + β_{2}X_{2si} + (β_{3}+S_{3s})X_{3si} + β_{4}X_{1si}X_{2si} + β_{5}X_{2si}X_{3si} + \epsilon_{si}$

Obviously, it is an exploratory analysis. You need to find the model that fit the data. My experience is fit several fixed effect models (linear models) with temperature alone and with other covariates, even the interactions. If you cannot find any model as you expect, maybe your theory is incorrect. If you find what you want, try to add the random effects in the model, such that the final model will be more reasonable.

In mixed model (in matrix),

$Y = X\beta + Z\gamma + \epsilon$, where $\gamma ~ N(0, G)$ and $\epsilon ~ N(0,R)$. For a given $X$, the variance-covariance of $Y$ is

$Var(Y) = ZGZ'+R$

Generally, we are not interesting in the random effect, instead we want to estimate the fixed effect $\beta$. The purpose of including random effect in the model is to make sure the model is more suitable to the real situation when the correlation exists among the response variable. If $Z$ has many columns with complicated structure, it is difficult to figure out what $ZGZ'$ looks like. It means you do not know what model you are fitting. Theoretically, you can have many continue variables in $Z$, but in practice, it is difficult to explain when you have two or more continue variables in $Z$.

Another method is get rid of random effect, and specify the variance-covariance matrix directly though $R$. When the variance-covariance structure is clear, this method is better than random effect.

In your case, if you think that temperature has effect on the correlation, for example, the two measurements from the same subject have higher correlation if the the temperatures are close, you can specify the $R$ though difference of the temperature, such as $\rho^{|t_i-t_j|}$.

## Best Answer

You don't need to omit an individual if there is only missingness for a few observations. In fact, you want to include participants with missingness to increase your power and avoid biasing your results. The nice thing about mixed-effects is that they handle missing data pretty well with maximum likelihood estimation, especially in the context of longitudinal designs.

After taking a look at the syntax below, you'll notice that the estimates between the full model and the missingness model are fairly similar given the context of the extremely small sample size. Additionally, if you specify a random slope you can you can also extract Empirical Bayes estimates using the ranef() function, which gives you an estimated slope for each participant.

These are calculated using both information from the individual and information from the rest of the sample. In the case of more extreme observations or individuals with smaller sample sizes (due to missing data), estimates will be adjusted toward the mean of the overall sample, which is a concept known as "shrinkage." There is a pretty good review on growth curves in a mixed-effects framework that can be found here, although the author uses the nlme package rather than lme4.