Solved – Lognormal distribution using binned or grouped data

lognormal distributionmaximum likelihood

I understand the Max likelihood estimators for mu and sigma for the lognormal distribution when data are actual values. However I need to understand how these formulas are modified when data are already grouped or binned (and actual values are not available). Specifically, for mu, the mle estimator is the sum of the logs of each X (divided by n which is the number of points). For sigma squared, the mle estimator is the sum of (each log X minus the mu, squared); all divided by n. (Order of operations is taking each log X minus the mu; square that; sum that over all X's; then divide by n). Now suppose data in bins b1, b2, b3, and so on where b1 to b2 is the first bin; b2 to b3 second bin and so on. What are the modified mu and sigma squared? thank you.

Best Answer

Let $\Phi$ be the cumulative standard normal distribution function. The probability that a value $Y$ drawn from a lognormal distribution with log mean $\mu$ and log SD $\sigma$ lies in the interval $(b_i, b_{i+1}]$ therefore is

$$\Pr(b_i \lt Y \le b_{i+1}) = \Phi \left( \frac{\log(b_{i+1}) - \mu}{\sigma} \right) - \Phi \left( \frac{\log(b_{i}) - \mu}{\sigma} \right).$$

Call this value $f_i(\mu, \sigma)$.

When the data consist of independent draws $Y_1,Y_2, \ldots, Y_N$, with $Y_i$ falling in bin $j(i)$ and the bin cutpoints are established independently of the $Y_i$, the probabilities multiply, whence the log likelihood is the sum of the logs of these values:

$$\log(\Lambda(\mu, \sigma)) = \sum_{i=1}^{N} \log(f_{j(i)}(\mu, \sigma)).$$

It suffices to count the number of $Y_i$ falling within each bin $j$; let this count be $k(j)$. By collecting the $k(j)$ terms associated with bin $j$ for each bin, the sum condenses to

$$\log(\Lambda(\mu, \sigma)) = \sum_{j} k(j) \log(f_{j}(\mu, \sigma)).$$

The MLEs are the values $\hat{\mu}$ and $\hat{\sigma}$ that together maximize $\log(\Lambda(\mu, \sigma))$. There is no closed formula for them in general: numerical solutions are needed.

Example

Consider data values known only to lie within the even intervals $[0,2]$, $[2,4]$, etc. I randomly generated 100 of them according to a Lognormal(0,1) distribution. In Mathematica this can be done via

With[{width = 2},
  data = width {Floor[#/width], Floor[#/width] + 1} & /@  
    RandomReal[LogNormalDistribution[0, 1], 100]
];

Here are their tallies:

Interval Count
 [0, 2]    77
 [2, 4]    16
 [4, 6]     5
 [6, 8]     1
[16,18]     1

Finding the MLE for data like this requires two procedures. First, one to compute the contribution of a list of all 100 intervals to the log likelihood:

logLikelihood[data_, m_, s_] := 
With[{f = CDF[LogNormalDistribution[m, s], #] &},
  Sum[Log[f[b[[2]]] - f[b[[1]]]], {b, data}]
];

Second, one to numerically maximize the log likelihood:

mle = NMaximize[{logLikelihood[data, m, s], s > 0}, {m, s}]

The solution reported by Mathematica is

{-77.0669, {m -> -0.014176, s -> 0.952739}}

The first value in the list is the log likelihood and the second (evidently) gives the MLEs of $\mu$ and $\sigma$, respectively. They are comfortably close to their true values.

Other software systems will vary in their syntax, but typically they will work in the same way: one procedure to compute the probabilities and another to maximize the log likelihood determined by those probabilities.