Solved – How to fit a linear mixed effects model to the data using lme4 in R

lme4-nlmer

My data consists of 790 columns of normalized metabolite expression values (some collected at a first and second-time point) for approximately 100 subjects ( so 200 gene expression values corresponding to 790 different metabolites altogether).

I am trying to fit a linear mixed effect model to my data, and I am having some medication kept as fixed effects which are known to influence the metabolite levels (BB, WA, ACOG), and also age and adiposity.

My predictor here will be my disease status, and the response will be metabolite levels. Here is the formula I have tried so far ( although I am coming up against some error messages), therefore I don't think I have my response variable in the correct format. Please could anybody suggest a correct format?

        fit <- lmer (data1[, 1:790] ~ data1$Diseasestatus + 
         + data1$BB + data1$WA + data1$ACOG + data1$Age + data1$Adiposity ,REML = F)  

Error messages include;

  Error in model.frame.default(drop.unused.levels = TRUE, formula = data1[, 1:790] ~  : 
     invalid type (list) for variable 'data1'

Just using one particular metabolite as the response i.e data1[, 1] got the error message of

  Error in KhatriRao(sm, t(mm)) : (p <- ncol(X)) == ncol(Y) is not TRUE

I am sorry if this is a naive question. No examples I have found thus far actually show the format of the response variable data. Any suggestions would be greatly appreciated. Thank-you

Edit; I aim to compare the 790 metabolites between the control and inflammatory bowel disease groups ( through looking at the effect size and also p values). I have found some code that I have changed slightly, but I am still not able to do see any output ( once I have converted the data into the correct format). I ultimately would like a table with one column metabolites, one column effect size values and one column p values. I am not getting this unfortunately;

   fit1 <- lmer(expr_val ~ Diseasestatus + BB + WA + ACOG + Age + Adiposity + 
  (1|participantID), data = data_long)

   compare <- lmer(expr_val ~ BB + WA + ACOG + Age + Adiposity + 
                   (1|participantID), data = data_long)


   test<-anova(fit, compare)
   out<-summary(fit)$coef
   res <- c(out[2,1],out[2,2],a$"Pr(>Chisq)[2],summary(fit1)$devcomp$dims[1])

I have included some data too (for the first 20 participants (ID I have changed) and the first 10 metabolites, in case this helps. 0 for medication columns means the participant is not taking the drug, 1 they are taking it. Similarly for disease status 0 is control, 1 is disease. BMI values are used here too rather than adiposity measurements.

dput(data) =

 structure(list(Participant_ID = c(34L, 35L, 119L, 157L, 158L, 
 208L, 209L, 1364L, 1365L, 127911L, 127912L, 154110L, 154120L, 
 167113L, 167123L, 171713L, 171724L, 184212L, 184213L), BB = c(0L, 
 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 
 0L, 1L), ACOG = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L), WA = c(0L, 0L, 0L, 0L, 0L, 0L, 
 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), BMI = c(23.94688606, 
 25.87052536, 26.38413048, 24.10971069, 27.77280045, 24.93728065, 
 26.8804493, 23.90113258, 25.07429123, 27.60118484, 23.12600708, 
 26.39195442, 23.01516533, 31.3666172, 31.80447578, 24.03654861, 
 25.11828613, 24.17065239, 28.48728561), Age = c(76L, 76L, 68L, 
 68L, 68L, 57L, 57L, 56L, 56L, 60L, 60L, 44L, 44L, 58L, 58L, 71L, 
 71L, 56L, 56L), Diseasestatus = c(0L, 0L, 1L, 1L, 0L, 0L, 1L, 
 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L), Met1 = c(0.326537646, 
 0.362137501, 0.403331692, 0.343789581, 0.437786804, 0.720648545, 
 0.974583105, 0.565800103, 0.613001417, 0.547743467, 0.337683125, 
 0.393250468, 0.465795971, 0.390206584, 0.172261362, 0.382496277, 
 0.435237338, 0.945312001, 0.321214419), Met2 = c(0.465736593, 
 0.540715637, 0.472693123, 0.681156674, 0.416291697, 0.487306504, 
 0.499092007, 0.634904337, 0.408109505, 0.808546214, 0.4113336, 
 0.924069141, 0.673204104, 0.693500596, 0.522794352, 0.373602067, 
 0.716407827, 0.649634492, 0.514429127), Met3 = c(0.902854296, 
 0.413241218, 0.418436978, 0.599698582, 0.806269489, 0.746859677, 
 0.461750237, 0.534943022, 0.511101841, 0.339406025, 0.235624644, 
 0.405761674, 0.312947287, 0.409833325, 0.026137354, 0.477175654, 
 0.387610389, 0.226427797, 0.19742037), Met4 = c(0.99425024, 0.923934731, 
 0.804677487, 0.31081605, 0.351561982, 0.529615606, 0.756342125, 
 0.968115646, 0.989016517, 0.938703504, 0.841777433, 0.103150219, 
 0.68397041, 0.903129097, 0.897388285, 0.905293975, 0.992337012, 
 0.358619626, 0.159601445), Met5 = c(0.527268407, 0.646332723, 
 0.646042578, 0.163344212, 0.202267074, 0.536976636, 0.789061409, 
 0.725657854, 0.697350164, 0.044081822, 0.959496477, 0.295039796, 
 0.120109301, 0.160817478, 0.901107461, 0.529179518, 0.573373775, 
 0.560701172, 0.325806613), Met6 = c(0.809497068, 0.614253411, 
 0.375421856, 0.446069992, 0.710859888, 0.474587655, 0.217817798, 
 0.464787031, 0.5540375, 0.62822217, 0.082906217, 0.294754096, 
 0.862216149, 0.427856328, 0.418944666, 0.516181576, 0.544516281, 
 0.519113772, 0.279522811), Met7 = c(0.419627992, 0.365954584, 
 0.434398151, 0.313441811, 0.368051981, 0.660614914, 0.825809828, 
 0.412109302, 0.545740249, 0.326247449, 0.373035298, 0.380623499, 
 0.428859232, 0.321044089, 0.24939936, 0.298372835, 0.387467105, 
 0.906034877, 0.147250125), Met8 = c(0.549683979, 0.347795497, 
 0.465729386, 0.625045713, 0.551784129, 0.348174756, 0.4334509, 
 0.594903245, 0.561353241, 0.621274979, 0.231389704, 0.308801446, 
 0.464799907, 0.401663011, 0.332966555, 0.109698561, 0.184359915, 
 0.091447702, 0.20568595), Met9 = c(0.605266628, 0.316564583, 
 0.166558136, 0.337470002, 0.458328756, 0.409329111, 0.269424154, 
 0.514746553, 0.408357879, 0.572246814, 0.264718681, 0.125162297, 
 0.211230627, 0.655667116, 0.034006203, 0.189685846, 0.243832622, 
 0.360657636, 0.259174139), Met10 = c(0.576174353, 0.214361265, 
 0.523133504, 0.549085457, 0.430400583, 0.53943429, 0.441563681, 
 0.401805576, 0.386025835, 0.514017513, 0, 0.330305736, 0.567380079, 
 0.50505895, 0.242814909, 0.306522744, 0.132950297, 0.207312191, 
 0.328760686)), class = "data.frame", row.names = c(NA, -19L))

Best Answer

You should have one row per measurement per individual. So, if for each of 100 individuals you took 790 measurements twice, you should have a dataset that is 100*790*2 rows long. Your columns should be the participant ID, the metabolite type, and your other variables. Because each participant is measured multiple times, participant-level characteristics like age will be duplicated across rows (i.e., you will have multiple rows with the same age because they all correspond to the same individual).

To get the data in this format, you'll need to turn your dataset from wide to long. There are a variety of tools in R to do that, but I think the melt() function in the data.table package is fairly easy to use. Your code will look something like the following, assuming the only variables in your dataset are the 790 metabolite columns, participant ID, and the variables you've included in the model:

data_long <- melt(data, measure.vars = 1:790, variable.name = "met_type",
                  value.name = "expr_val")

This long dataset (which I've called data_long) should have a different structure from your current dataset. Rather than a column for each type of metabolite, there will be one column for the metabolite expression value (expr_val) and another for the type of metabolite (met_type). You should also have a column, which I've called participantID in the model, that identifies each participant (which you probably already have).

From here, how you specify the model will depend on your theories about the 790 metabolites. If each type of metabolite is important, you can add metabolite as a fixed, measurement-level covariate, which will be split into 789 dummy variables. If the specific type of the 790 metabolites is not important to your theory but needs to be accounted for, you can add a random effect for metabolite, which gives your model two random effects (one for participant and one for metabolite type). The model, with two random effects, should look like the following:

fit <- lmerTest::lmer(expr_val ~ Diseasestatus + BB + WA + ACOG + Age + Adiposity + 
                       (1|participantID) + (1|met_type), data = data_long)

If you run summary() on the fit object, you will get the coefficient estimates and estimates of the variance components. If you want to get p-values to test whether the coefficient of Diseasestatus differs from zero, you need to use the lmerTest package. The coefficient on Diseasestatus represents the difference between the average metabolite expression value for those with the disease and that for those without.


EDIT:

It looks like you want an estimate of the effect for each metabolite. It makes sense to run a separate model for each metabolite, though after doing so you should correct for multiple comparisons. You need to loop through each metabolite and fit a separate model to each one. You could do this with either the long or the wide dataset. We'll stick with the wide dataset since it's smaller.

fit_list <- setNames(lapply(paste0("Met", 1:790), function(i) {
    fit <- lmer(data_wide[[i]] ~ Diseasestatus + BB + WA + ACOG + 
                   Age + Adiposity + (1|participantID), 
                data = data_wide)
    summary(fit)$coef["Diseasestatus", c("Estimate", "Pr(>|t|)")]
}), paste0("Met", 1:790))

dplyr::bind_rows(fit_list, .id = "Metabolite")

To run this model you must have the lmerTest package loaded. You didn't see any p-values previously because you didn't have it loaded before running the model. The p-values from these tests use the Satterthwaite degrees of freedom.

Related Question