I have a sample of data that follows a lognormal distribution. I would like to represent the distribution as a "Gaussian" histogram and overlayed fit (along a logarithmic x-axis) instead of a lognormal representation. For simplicity, I'll call the average and sigma of the lognormal data mu_log
and sigma_log
, respectively. It is my (possibly incorrect) understanding that the average of the normal representation should then be mu_norm = exp(mu_log)
, and sigma of the normal represenation should be sigma_norm = exp(sigma_log)
. In order to make the histogram follow a Gaussian shape, I can take the log of every value in my data; I'll call it data_log
, which becomes data_norm
when normalized to one.
Q1) I am performing a Chi Square analysis to find the optimized mu and sigma that produce the best fitting curve to my histogram. When counting observed values per bin and computing expectation values per bin, do I use the original lognormal data or data_norm? Does it necessarily matter? For the expectation values, should I integrate the formula given for a normal distribution or a lognormal distribution?
Q2) Will performing the Chi Square analysis produce the mu and sigma for a histogram fit of my lognormal data or a histogram fit of data_norm (or could it be either, depending on Q1)?
Q3) Once I have the parameters mu and sigma that give a Gaussian shape to my histogram and data overlay, I need the normalization constant (let's call it c_norm). In the case of a plain Gaussian, c_norm = 1 / ((2 * pi)^0.5 * sigma)
. But in the case of a lognormal distribution, c_norm = 1 / (x * (2 * pi)^0.5 * sigma)
. I am guessing that I use sigma and c_norm from the normal distribution to find the normalization constant.
PS: I am asking because I have tried repeatedly and failed. As seen in this picture, I was able to fit a curve to a normal distribution (left), but my Gaussian fit for a lognormal distribution (right) does not look correct. I can post/message my python code for that plot, but it is a bit lengthy.
Best Answer
By definition, a random variable $Z$ has a Lognormal distribution when $\log Z$ has a Normal distribution. This means there are numbers $\sigma\gt 0$ and $\mu$ for which the density function of $X = (\log(Z) - \mu)/\sigma$ is
$$\phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}.$$
The density of $Z$ itself is obtained by substituting $(\log(z)-\mu)/\sigma$ for $x$ in the density element $\phi(x)\mathrm{d}z$:
$$\eqalign{ f(z;\mu,\sigma)\mathrm{d}z &= \phi\left(\frac{\log(z) - \mu}{\sigma}\right)\mathrm{d}\left(\frac{\log(z) - \mu}{\sigma}\right) \\ &=\frac{1}{z\,\sigma}\phi\left(\frac{\log(z) - \mu}{\sigma}\right)\mathrm{d}z. }$$
For $z \gt 0$, this is the PDF of a Normal$(\mu,\sigma)$ distribution applied to $\log(z)$, but divided by $z$. That division resulted from the (nonlinear) effect of the logarithm on $\mathrm{d}z$: namely, $$\mathrm{d}\log z = \frac{1}{z}\mathrm{d}z.$$
Apply this to fitting your data: estimate $\mu$ and $\sigma$ by fitting a Normal distribution to the logarithms of the data and plug them into $f$. It's that simple.
As an example, here is a histogram of $200$ values drawn independently from a Lognormal distribution. On it is plotted, in red, the graph of $f(z;\hat\mu,\hat\sigma)$ where $\hat \mu$ is the mean of the logs and $\hat \sigma$ is the estimated standard deviation of the logs.
You might like to study the (simple)
R
code that produced these data and the plot.This analysis appears to have addressed all the questions. Because it isn't clear what you mean by a "Chi Square analysis," let me finish with a warning: if you mean to compute a chi-squared statistic from a histogram of the data and obtain a p-value from it using a chi-squared distribution, then there are many pitfalls to beware. Read and study the account at https://stats.stackexchange.com/a/17148/919 and especially note the need to (a) establish the bin cutpoints independent of the data and (b) estimate $\mu$ and $\sigma$ by means of Maximum Likelihood based on the bin counts alone (rather than the actual data).