Which type of Mixed Model is appropriate and which random effects to use with R lme()

bioinformaticsexperiment-designlme4-nlmemixed modelr

I have collected a data-set and run into some problems on how to correctly use a linear mixed model to analyse my data.
The data collection (simplified): I have 24 family groups (“family”) of 2 individuals each that were treated with A, B, C or D (“treat1”, e.g. group 1-6 was treated with A, group 7-12 was treated with B etc.; total n = 48). Then, I separated all individuals and one of the two individuals per family was treated with E, while the other was not as a control (“treat2”). As dependent variable, I measured various genes’ expression levels (which I log-transformed subsequently).
The main question I'd like to address is: does “treat1” (A,B,C,D) affect the individuals’ response (gene expression) to “treat2” (E)?
Assuming log-transformed gene expression is normally distributed, I’d use the following model for each gene individually (from the nlme package):
lme(gene_expression~treat1*treat2, random = 1|family, method=”REML”)

I conceptually assume that “family” should not affect the interaction of treat1 and treat2, which is why I only have a random intercept here, but I wonder if the model can actually deal with the experimental design, specifically the fact that each family only reflects one level of treat1, but two levels of treat2.
While this model mostly appears to produce reasonable results, I occasionally get a “singular convergence (7)” error message, and I am not sure why that is (the response data looks fine as genes with many 0 expression levels are not considered).

It would be wonderful to get somebody’s opinion on this, specifically if the random intercept is fine or maybe a random slope has to be added, and on what might cause the error message. Thank you so much!

Best Answer

With 16000 genes you will be better off using software packages that are designed to handle large-scale gene-expression data. The DESeq2 and edgeR Bioconductor packages were designed for working with RNAseq data, and the venerable limma package originally designed for spotted microarrays can work with such data, too.

Your gene-by-gene analysis does not model variance as a function of gene expression level, unlike the packages cited above. For example, the DESeq2 package performs negative binomial modeling of RNA-seq counts as a function of both sample-specific and gene-specific factors. That provides better pooled error estimates of variances to use for differential expression analysis than does the constant variance on a log scale implicit in your approach, potentially improving power to detect true changes. Those packages can help deal with outliers and handle the multiple-comparisons problem. They accept design matrices in ways that should accommodate your experimental design.

Model matrix example

The main question I'd like to address is: does “treat1” (A,B,C,D) affect the individuals’ response (gene expression) to “treat2” (E)?

Section 3.5 of the edgeR Users Guide has a design addressing essentially the same question as yours. Each individual Patient, having one of 3 Disease types, received both a control and a hormone Treatment. The question there is whether the Disease affects the response to Treatment, like your interest in whether "treat1" affects the response to "treat2"; it has pairing like yours.

To get a corresponding design matrix for your study, replace the User Guide's 9-level Patient with your 24-level "family"; its 3-level Disease with your 4-level "treat1"; its 2-level Treatment with your 2-level "treat2":

fam <- factor(rep(1:24,each=2))
trt1 <-factor(rep(LETTERS[1:4],each=12))
trt2 <- relevel(factor(rep(c("None","E"),24)),ref="None")
AE <- trt1=="A" & trt2 =="E"
BE <- trt1=="B" & trt2 =="E"
CE <- trt1=="C" & trt2 =="E"
DE <- trt1=="D" & trt2 =="E"
design <- model.matrix(~fam)
design <- cbind(design,AE,BE,CE,DE)

That accomplishes with pairing the critical part of what your mixed model was doing. That section of the User Guide then shows how to use the resulting model to find genes that respond differentially to combinations of conditions.

The section of the DESeq2 vignette on "Group-specific condition effects, individuals nested within groups" suggests a more efficient model-matrix coding inheriting from that edgeR method. As the particular "family" names aren't themselves important and no "family" is involved in more than 1 level of "treat1", you can set up an "ind.n" factor that just annotates the 6 separate families within each level of "treat1." Then your model matrix could be based on the formula ~treat1 + treat1:ind.id + treat1:treat2. The vignette goes on to illustrate how to get comparisons of interest.

I haven't carefully thought through the differences between those two suggestions. The point is that these standard packages should be able to answer your fundamental question.

A comment on an earlier version of this answer suggests that you might have additional covariates. If so, the edgeR-recommended model matrix only has 28 columns and the DESeq2 recommendation only has 12, allowing you to add columns for some additional covariates (if they aren't linearly dependent on the columns already included).

If you do need to use a mixed model, you might need to consider a two-step approach as in Trabzuni et al., Bioinformatics, Volume 30, Issue 11, 1 June 2014, Pages 1555–1561, which combined mixed modeling with subsequent finite mixture modeling to separate differentially expressed from non-differential transcripts.