Solved – Need algorithm to compute relative likelihood that data are sample from normal vs lognormal distribution

lognormal distributionnormal distribution

Let's say you have a set of values, and you want to know if it is more likely that they were sampled from a Gaussian (normal) distribution or sampled from a lognormal distribution?

Of course, ideally you'd know something about the population or about the sources of experimental error, so would have additional information useful to answering the question. But here, assume we only have a set of numbers and no other information. Which is more likely: sampling from a Gaussian or sampling from a lognormal distribution? How much more likely? What I am hoping for is an algorithm to select between the two models, and hopefully quantify the relative likelihood of each.

Best Answer

You could take a best guess at the distribution type by fitting each distribution (normal or lognormal) to the data by maximum likelihood, then comparing the log-likelihood under each model - the model with the highest log-likelihood being the best fit. For example, in R:

# log likelihood of the data given the parameters (par) for 
# a normal or lognormal distribution
logl <- function(par, x, lognorm=F) {
    if(par[2]<0) { return(-Inf) }
    ifelse(lognorm,
    sum(dlnorm(x,par[1],par[2],log=T)),
    sum(dnorm(x,par[1],par[2],log=T))
    )
}

# estimate parameters of distribution of x by ML 
ml <- function(par, x, ...) {
    optim(par, logl, control=list(fnscale=-1), x=x, ...)
}

# best guess for distribution-type
# use mean,sd of x for starting parameters in ML fit of normal
# use mean,sd of log(x) for starting parameters in ML fit of lognormal
# return name of distribution type with highest log ML
best <- function(x) {
    logl_norm <- ml(c(mean(x), sd(x)), x)$value
        logl_lognorm <- ml(c(mean(log(x)), sd(log(x))), x, lognorm=T)$value
    c("Normal","Lognormal")[which.max(c(logl_norm, logl_lognorm))]
}

Now generate numbers from a normal distribution and fit a normal distribution by ML:

set.seed(1)
x = rnorm(100, 10, 2)
ml(c(10,2), x)

Produces:

$par
[1] 10.218083  1.787379

$value
[1] -199.9697
...

Compare log-likelihood for ML fit of normal and lognormal distributions:

ml(c(10,2), x)$value # -199.9697
    ml(c(2,0.2), x, lognorm=T)$value # -203.1891
best(x) # Normal

Try with a lognormal distribution:

best(rlnorm(100, 2.6, 0.2)) # lognormal

Assignment will not be perfect, depending on n, mean and sd:

> table(replicate(1000, best(rnorm(500, 10, 2))))

Lognormal    Normal 
        6       994 
> table(replicate(1000, best(rlnorm(500, 2.6, 0.2))))

Lognormal    Normal 
      999         1 
Related Question