Solved – generalized linear model with log link using log transformed fixed/random effects

generalized linear modellognormal distributionmixed modelpredictive-models

I am modelling a longitudinal dataset consisting of a continuous response variable (mutation count) with a binary predictor (medical history, ie previous medications) while accounting for time and each individual with fixed effects.

My aim is to measure the extent to which the health measures can predict mutation prevalence. Due to non-normal distributions in mutation counts, I am using a generalized linear mixed model. In a related question (below), that notes the problems of using a log transformed response variable, I was wondering what are some issues to be aware of if a log transformed random/fixed effect, in my case, time, is used? I only ask because when time is log transformed, I get significantly better model fit values that don't effect the distribution of residuals.

My model in lme4 is:

glmer(Mutations ~ Medication + (1 | TimeLog) + (1 | Sample), 
      data = inputdata, family = Gamma(link = "log"))

Linear model with log-transformed response vs. generalized linear model with log link

Best Answer

I sense two areas of confusion here.

One is the logarithmic data transformation of predictor variables (like mapping Time to TimeLog) versus the logarithmic link function used in the generalized linear model. The former has to do with the predictor variables, the second with the response variable and its relationship to the linear part of the model.

In ordinary least-squares linear regression, it is standard practice to transform predictor variables as necessary to meet desirable characteristics like linearity, constant variance of the residuals between predictions and observed outcome values, and so on. So a log transform of time (as a predictor variable) might be called for regardless of the type of linear model you are pursuing. The linear regression provides, for any case of interest, a single linear predictor that is a linear combination of all the (potentially transformed) predictor-variable values for that case.

A generalized linear model allows such linear modeling of outcome variables that might not be adequately handled without further transformation of a linear predictor, which in principle could provide predicted values over all of $(-\infty,\infty)$. The link function in a generalized linear model has to do with mapping between the linear predictor and the response variable; it doesn't directly care whether the original predictor variables were somehow transformed before they were combined into the overall linear predictor. So from that perspective you don't have to worry.

The second area of confusion is in your formulation of the generalized linear mixed model. As Isabella Ghement and Dimitris Rizopoulos have both mentioned, there are two problems here. First, unless you are dealing with such large numbers of mutations that they effectively have a continuous distribution, count data should be modeled as count data with Poisson or negative-binomial generalized linear models. Second, the way you have treated your time variable as a random effect (you say "fixed effects" in the question but you evidently meant "random effects" from the formulation of your model) would only rarely make sense. Please make sure that you fully understand the implications of treating time as a random effect in the way that you have, as others have noted. Did you perhaps intend to treat time as a fixed effect but with a different slope versus time for different individuals? If so, please consult the lmer cheat sheet for the correct way to code that.

In response to comment:

The best way to capture a change of Mutations with Time is to include Time as a fixed effect. (Including Time, however transformed, as a random effect as in your model doesn't accomplish that in any useful way that I see.) The regression coefficient for Time then gives a direct measure of the rate of increase of Mutations with Time. (For simplicity, I'm assuming Mutations to increase linearly over Time, and ignoring for now the link function of the generalized model.) Your model doesn't presently include a fixed effect for Time in any way.

If you think that Medication will affect the rate of increase of Mutations with Time, as opposed to simply affecting the number of Mutations at Time=0, then you need also to include an interaction term between the two fixed effects of Mutations and Time. The intercept of the model (under default R handling) is then the value of Mutations at Time=0 for whatever Medication you have specified as the reference category.

Your (1|Sample) term then allows that intercept to differ among Samples. For the rate of change of Mutations also to differ among Samples (beyond any effects due to Medication differences among samples), add a term involving (Time|Sample). That's precisely how the web page you linked in your comment allowed Time to contribute to a random effect term even though it is a fixed effect. This answer on the lmer cheat sheet shows how to specify such a term depending on the assumptions that you are willing to make.

Related Question