Lognormal Distribution – MLE/Likelihood of Lognormally Distributed Interval

interval-censoringlikelihoodlognormal distributionmaximum likelihoodr

I have a variable set of responses that are expressed as an interval such as the sample below.

> head(left)
[1]  860  516  430 1118  860  602
> head(right)
[1]  946  602  516 1204  946  688

where left is the lower bound and right is the upper bound of the response. I want to estimate the parameters according to the lognormal distribution.

For a while when I was trying to calculate the likelihoods directly I was struggling with the fact that since the two bounds are distributed along different set of paramaters, I was getting some negative values like below:

> Pr_high=plnorm(wta_high,meanlog_high,sdlog_high)
> Pr_low=plnorm(wta_low, meanlog_low,sdlog_low)
> Pr=Pr_high-Pr_low
> 
> head(Pr)
[1] -0.0079951419  0.0001207749  0.0008002343 -0.0009705125 -0.0079951419 -0.0022395514

I couldn't really figure out how to resolve it and decided to use the mid-point of the interval instead which is a good compromise until I found mledist function which extracts the loglikelihood of an interval response, this is the summary I get:

> mledist(int, distr="lnorm")
$estimate
meanlog     sdlog 
6.9092257 0.3120138 

$convergence
[1] 0

$loglik
[1] -152.1236

$hessian
         meanlog       sdlog
meanlog 570.760358    7.183723
sdlog     7.183723 1112.098031

$optim.function
[1] "optim"

$fix.arg
NULL

Warning messages:
1: In plnorm(q = c(946L, 602L, 516L, 1204L, 946L, 688L, 1376L, 1376L,  :
NaNs produced
2: In plnorm(q = c(860L, 516L, 430L, 1118L, 860L, 602L, 1290L, 1290L,  :
NaNs produced

The parameter values seem to make sense and the loglikelihood is greater than any other method I have used (mid-point distribution or distribution of either one of the bounds).

There is a warning message which I don't understand so could anyone tell me if I am doing the right thing and what this message means?

Appreciate the help!

Best Answer

It sounds like you might not be computing the likelihood correctly.

When all you know about a value $x$ is that

  1. It is obtained independently from a distribution $F_\theta$ and

  2. It lies between $a$ and $b \gt a$ inclusive (where $b$ and $a$ are independent of $x$),

then (by definition) its likelihood is $${\Pr}_{F_\theta}(a \le x \le b) = F_\theta(b) - F_\theta(a).$$ The likelihood of a set of independent observations therefore is the product of such expressions, one per observation. The log-likelihood, as usual, will be the sum of logarithms of those expressions.


As an example, here is an R implementation where the values of $a$ are in the vector left, the values of $b$ in the vector right, and $F_\theta$ is Lognormal. (This is not a general-purpose solution; in particular, it assumes that $b \gt a$ and $b \ne a$ for all the data.)

#
# Lognormal log-likelihood for interval data.
#
lambda <- function(mu, sigma, left, right) {
  sum(log(pnorm(log(right), mu, sigma) - pnorm(log(left), mu, sigma)))
}

To find the maximum log likelihood we need a reasonable set of starting values for the log mean $\mu$ and log standard deviation $\sigma$. This estimate replaces each interval by the geometric mean of its endpoints:

#
# Create an initial estimate of lognormal parameters for interval data.
#
lambda.init <- function(left, right) {
  mid <- log(left * right)/2
  c(mean(mid), sd(mid))
}

Let's generate some random lognormally distributed data and bin them into intervals:

set.seed(17)
n <- 12                     # Number of data
z <- exp(rnorm(n, 6, .5))   # Mean = 6, SD = 0.5
left <- 100 * floor(z/100)  # Bin into multiples of 100
right <- left + 100

The fitting can be performed by a general-purpose multivariate optimizer. (This one is a minimizer by default, so it must be applied to the negative of the log-likelihood.)

fit <- optim(lambda.init(left,right), 
             fn=function(theta) -lambda(theta[1], theta[2], left, right))
fit$par

6.1188785 0.3957045

The estimate of $\mu$ is $6.12$, not far from the intended value of $6$, and the estimate of $\sigma$ is $0.40$, not far from the intended value of $0.5$: not bad for just $12$ values. To see how good the fit is, let's plot the empirical cumulative distribution function and the fitted distribution function. To construct the ECDF, I just interpolate linearly through each interval:

#
# ECDF of the data.
#
F <- function(x) (1 + mean((abs(x - left) - abs(x - right)) / (right - left)))/2

y <- sapply(x <- seq(min(left) * 0.8, max(right) / 0.8, 1), F)
plot(x, y, type="l", lwd=2, lty=2, ylab="Cumulative probability")
curve(pnorm(log(x), fit$par[1], fit$par[2]), from=min(x), to=max(x), col="Red", lwd=2, 
  add=TRUE)

Plots

Because the vertical deviations are consistently small and vary both up and down, it looks like a good fit.

Related Question