Solved – Linear model with log-transformed response vs. generalized linear model with log link

generalized linear modellognormal distributionmodel selection

In this paper titled "CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA" the authors write:

In a generalized linear model, the mean is transformed, by the link
function, instead of transforming the response itself. The two methods
of transformation can lead to quite different results; for example,
the mean of log-transformed responses is not the same as the logarithm
of the mean response
. In general, the former cannot easily be
transformed to a mean response. Thus, transforming the mean often
allows the results to be more easily interpreted, especially in that
mean parameters remain on the same scale as the measured responses.

It appears they advise the fitting of a generalized linear model (GLM) with log link instead of a linear model (LM) with log-transformed response. I do not grasp the advantages of this approach, and it seems quite unusual to me.

My response variable looks log-normally distributed. I get similar results in terms of the coefficients and their standard errors with either approach.

Still I wonder: If a variable has a log-normal distribution, isn't the mean of the log-transformed variable preferable over the log of the mean untransformed variable, as the mean is the natural summary of a normal distribution, and the log-transformed variable is normally distributed, whereas the variable itself is not?

Best Answer

Although it may appear that the mean of the log-transformed variables is preferable (since this is how log-normal is typically parameterised), from a practical point of view, the log of the mean is typically much more useful.

This is particularly true when your model is not exactly correct, and to quote George Box: "All models are wrong, some are useful"

Suppose some quantity is log normally distributed, blood pressure say (I'm not a medic!), and we have two populations, men and women. One might hypothesise that the average blood pressure is higher in women than in men. This exactly corresponds to asking whether log of average blood pressure is higher in women than in men. It is not the same as asking whether the average of log blood pressure is higher in women that man.

Don't get confused by the text book parameterisation of a distribution - it doesn't have any "real" meaning. The log-normal distribution is parameterised by the mean of the log ($\mu_{\ln}$) because of mathematical convenience, but equally we could choose to parameterise it by its actual mean and variance

$\mu = e^{\mu_{\ln} + \sigma_{\ln}^2/2}$

$\sigma^2 = (e^{\sigma^2_{\ln}} -1)e^{2 \mu_{\ln} + \sigma_{\ln}^2}$

Obviously, doing so makes the algebra horribly complicated, but it still works and means the same thing.

Looking at the above formula, we can see an important difference between transforming the variables and transforming the mean. The log of the mean, $\ln(\mu)$, increases as $\sigma^2_{\ln}$ increases, while the mean of the log, $\mu_{\ln}$ doesn't.

This means that women could, on average, have higher blood pressure that men, even though the mean paramater of the log normal distribution ($\mu_{\ln}$) is the same, simply because the variance parameter is larger. This fact would get missed by a test that used log(Blood Pressure).

So far, we have assumed that blood pressure genuinly is log-normal. If the true distributions are not quite log normal, then transforming the data will (typically) make things even worse than above - since we won't quite know what our "mean" parameter actually means. I.e. we won't know those two equations for mean and variance I gave above are correct. Using those to transform back and forth will then introduce additional errors.