Solved – What are the dangers of using a (log)normal distribution for a discrete response

discrete datanormal distribution

I have seen in the engineering field some papers (one example) using normal or lognormal distributions to model discrete outcomes.
Typically, the explanatory variable is binned (into equal intervals) to allow for each point to represent a probability to belong to a given outcome (i.e. number of points in the bin/total, for each bin).
Then the set of probabilities obtained is assumed to belong to a normal (lognormal) distribution as follows:

$$
P(Y) = \phi[((\log)X-\mu) / \sigma],
$$

so
$$
X = \sigma\phi^{-1} + \mu.
$$

Which allows them to evaluate the distribution parameters using OLS regression.

Because of the behaviour of the inverse of the normal distribution function at 0 or 1 they also have to dismiss all points that have such probabilities, which obviously leads to much data not being used, and this is one issue. Plus discrete outcomes should be modelled using appropriate distributions, i.e. binomial and multinomial.
But I cannot get my head around the other shortcomings of using such a method for discrete outcomes, I feel there must be many.
Can anyone provide more insight?

Best Answer

I wouldn't rule out the lognormal as an approximation for discrete positive variables. Last time I checked the populations of the countries of the world fit a lognormal distribution quite well and population is naturally discrete.

But judging from a glance at the example paper you cite, those researchers are not fitting to discrete variables at all. The applications are to inundation depth, current velocity and hydrodynamic force, which are all continuous, so that looks uncontroversial in principle. How well the lognormal fits in practice is a different question. In the broad field you mention, environmental hazards, I have a paper under review that shows that the lognormal fits some continuous data far better than the power law distribution suggested by some previous workers. I suspect that is common.

How best to fit the lognormal is a different and key question. Fitting by binning and least-squares is a fairly lousy method and there are better methods. For example, the easiest way is just to take logarithms of the raw data and calculate the mean and standard deviation of a normal and then exponentiate. There are many other ways to do it, but binning just discards detail in the data and raises small and indeed large questions about how far the fit depends on arbitrary decisions made about bin width (and sometimes bin origin).

I've seen elsewhere the practice of ignoring points with associated cumulative probability 0 and 1 and (as you imply) omitting data just because a quantile function cannot be evaluated is utterly indefensible. The use of plotting positions such as (rank - 0.5) / sample size has been the standard way to avoid such difficulties for more than a century. (Trimming extreme values to impart some resistance would be a quite different and possibly defensible practice.)

I think you are confusing variables that are discrete in principle and the binning of continuous variables as a matter of supposed convenience or necessity. Binning continues to have some uses (e.g. histograms remain popular), but using and plotting all the data as they arrive was always better in principle and is now almost always not difficult in computational practice.

Related Question