I have a variable set of responses that are expressed as an interval such as the sample below.
> head(left)
[1] 860 516 430 1118 860 602
> head(right)
[1] 946 602 516 1204 946 688
where left is the lower bound and right is the upper bound of the response. I want to estimate the parameters according to the lognormal distribution.
For a while when I was trying to calculate the likelihoods directly I was struggling with the fact that since the two bounds are distributed along different set of paramaters, I was getting some negative values like below:
> Pr_high=plnorm(wta_high,meanlog_high,sdlog_high)
> Pr_low=plnorm(wta_low, meanlog_low,sdlog_low)
> Pr=Pr_high-Pr_low
>
> head(Pr)
[1] -0.0079951419 0.0001207749 0.0008002343 -0.0009705125 -0.0079951419 -0.0022395514
I couldn't really figure out how to resolve it and decided to use the mid-point of the interval instead which is a good compromise until I found mledist function which extracts the loglikelihood of an interval response, this is the summary I get:
> mledist(int, distr="lnorm")
$estimate
meanlog sdlog
6.9092257 0.3120138
$convergence
[1] 0
$loglik
[1] -152.1236
$hessian
meanlog sdlog
meanlog 570.760358 7.183723
sdlog 7.183723 1112.098031
$optim.function
[1] "optim"
$fix.arg
NULL
Warning messages:
1: In plnorm(q = c(946L, 602L, 516L, 1204L, 946L, 688L, 1376L, 1376L, :
NaNs produced
2: In plnorm(q = c(860L, 516L, 430L, 1118L, 860L, 602L, 1290L, 1290L, :
NaNs produced
The parameter values seem to make sense and the loglikelihood is greater than any other method I have used (mid-point distribution or distribution of either one of the bounds).
There is a warning message which I don't understand so could anyone tell me if I am doing the right thing and what this message means?
Appreciate the help!
Best Answer
It sounds like you might not be computing the likelihood correctly.
When all you know about a value $x$ is that
It is obtained independently from a distribution $F_\theta$ and
It lies between $a$ and $b \gt a$ inclusive (where $b$ and $a$ are independent of $x$),
then (by definition) its likelihood is $${\Pr}_{F_\theta}(a \le x \le b) = F_\theta(b) - F_\theta(a).$$ The likelihood of a set of independent observations therefore is the product of such expressions, one per observation. The log-likelihood, as usual, will be the sum of logarithms of those expressions.
As an example, here is an
R
implementation where the values of $a$ are in the vectorleft
, the values of $b$ in the vectorright
, and $F_\theta$ is Lognormal. (This is not a general-purpose solution; in particular, it assumes that $b \gt a$ and $b \ne a$ for all the data.)To find the maximum log likelihood we need a reasonable set of starting values for the log mean $\mu$ and log standard deviation $\sigma$. This estimate replaces each interval by the geometric mean of its endpoints:
Let's generate some random lognormally distributed data and bin them into intervals:
The fitting can be performed by a general-purpose multivariate optimizer. (This one is a minimizer by default, so it must be applied to the negative of the log-likelihood.)
The estimate of $\mu$ is $6.12$, not far from the intended value of $6$, and the estimate of $\sigma$ is $0.40$, not far from the intended value of $0.5$: not bad for just $12$ values. To see how good the fit is, let's plot the empirical cumulative distribution function and the fitted distribution function. To construct the ECDF, I just interpolate linearly through each interval:
Because the vertical deviations are consistently small and vary both up and down, it looks like a good fit.