Solved – Difference between log-normal distribution and logging variables, fitting normal

lognormal distributionmixed modelnormal distributionr

Context: I have a set of data that is bimodal, so I used the mixtools package in R to fit a bimodal normal distribution to it. It looked as if the normal did not fit very well, and given other similar sets of data that I have (that are not bimodal) and are log-normally distributed, I figured log-normal would probably make more sense. However, mixtools does not have a way to fit a bimodal log-normal distribution, so I took the log of the data and refit. This image is below, and it fits to my satisfaction.

Really, this is kind of a software issue (not knowing what package could explicitly fit a bimodal log-normal) but it's made me think about something I have wondered often: is fitting a normal distribution to logged data equivalent to fitting a log-normal distribution to the original data? I suspect not, but I am not sure why? (Ignoring cases of having zeros or negative data. Assuming all data are fairly large positives, but skewed). Also knowing that of course you'd need to back transform to get to the original value estimates.

I tried to test this out with some toy data and realized I don't even know why the meanlog associated with a log-normal distribution is NOT what you get when you take the mean of the logged normal distribution. So maybe my understanding is already broken when it comes to the log-normal's parameters.

library(fitdistrplus)

set.seed(1)
test <- rnorm(1000, mean=100)
test[test<=0] <- NA #Unnecessary since no values <= 0, but just to prove
test<-na.omit(test)

log.test <- log10(test)
mean(log.test)
sd(log.test)
#1.999926 is mean for log.test
#0.004496153 is sd for log.test

fitdist(log.test, dist="lnorm", method="mle")
#However, "meanlog" is 0.693107737 and "sdlog" is 0.002247176
#The means are so different, not sure why?

enter image description here

Best Answer

Reference

$$ x \sim \log \mathcal{N}(\mu, \sigma^2) \\ \text{if} \\ p(x) = \frac{1}{x \sqrt{2\pi} \sigma} e^{- \frac{\left( \log(x) - \mu\right)^2}{2\sigma^2}}, \quad x > 0 $$

where $$ \text{E}[x] = e^{\mu + \frac{1}{2}\sigma^2}. $$

Note that $$ y \sim \log \mathcal{N}(m, v^2) \iff \log(y) \sim \mathcal{N}(m, v^2), $$

per this Q&A.

Answer

is fitting a normal distribution to logged data equivalent to fitting a log-normal distribution to the original data?

Theoretically? In most situations yes (see the logical equivalency above). The only case I found where it was useful to use the log-normal distribution explicitly was a case study of pollution data. In that instance, it was important to model weekdays and weekends differently in terms of pollution concentration ( $\mu_1 > \mu_2$ in the prior*), but have the expected values of the two log-normal distributions without restriction (I had to allow $e^{\mu_1 + \frac{1}{2}\sigma_1^2} \le e^{\mu_2 + \frac{1}{2}\sigma_2^2}$). Which day each measurement was taken was unknown, so the separate parameters had to be inferred.

You could certainly argue that this could be done without invoking the log-normal distribution, but this is what we decided to use and it worked.

I tried to test this out with some toy data and realized I don't even know why the meanlog associated with a log-normal distribution is NOT what you get when you take the mean of the logged normal distribution.

The reason for this is just a consequence of our notion of distance on the support. Since $\log$ is a monotone increasing function, log-transforming variables preserves order. For example, the median of the log-normal distribution is just $e^\mu$, the exponential of the median of the log-values (since the normal distribution mean is also its median).

However, the $\log$ function only preserves order, and not the distance function itself. Means are all about distance: the mean is just the point which, when points are weighted by their probabilities, is the closest to all other points in the Euclidean sense. All the log-values are being compressed towards $0$ in an uneven way (i.e., larger values are compressed more). In fact, the log of the mean of the log-normal distribution is higher than the mean of the log-values (i.e. $\mu$) by $\sigma$: $$ \log \left(e^{\mu + \frac{1}{2} \sigma^2} \right) = \mu + \frac{1}{2} \sigma^2 > \mu. $$ That is, the mean of the log-values is compressed in as a function of the spread of the distribution (i.e., involving $\sigma$) as a result of the $\log$ function compressing distances in an uneven way.


*As a side note, these kinds of artificial constraints in priors tend to under-perform other methods for inferring/separating distributions.

Related Question