I'm late to the party but it seems this question still needs an answer, and, surprising to me, I don't see any other questions on this topic. So I'll go ahead and offer this.
To summarize, the QMPE approach is to formulate a likelihood function in terms of the quantiles of the observed data and not the observed data themselves. Then parameters are sought which maximize the likelihood, by differentiating the log-likelihood wrt the parameters and applying a standard optimization method.
It appears that the authors are proposing to ignore the exactly values of the data and only work with quantiles. I'm not sure I agree with that, but in any event, the likelihood function as they state it is correct if one only has quantiles at hand, which is sometimes the case.
The likelihood function can be derived by considering the data failing between quantiles as being interval-censored; you know how many data are in each interval, but you don't know where they are. (That isn't how it's motivated by Heathcote et al., but that's what makes the most sense to me.) For each inter-quantile interval, there is a term
(P(datum in interval | parameters))^(number of data in interval)
Now the number of data in each interval is just n[i] = N (p[i] - p[i - 1])
where N
is the total number of data and p[i]
is the proportional corresponding to the i-th quantile q[i]
, i.e. # { data <= q[i] } / N = p[i]
, and P(datum in interval | parameters)
is just cdf(q[i] | parameters) - cdf(q[i - 1] | parameters)
where cdf
is the cumulative distribution function of the assumed distribution. Then the log-likelihood is the summation of
N (p[i] - p[i - 1]) log(cdf(q[i] | parameters) - cdf(q[i - 1] | parameters))
over the inter-quantile intervals indexed by i
. This is the same for any particular distribution; just plug in the appropriate cdf
.
To find the parameters which maximize the likelihood, given a list of quantiles, any method for maximizing a function can be employed. If cdf
is differentiable, it's probably most convenient to make use of the gradient wrt the parameters. I don't know if anything can be said about the derivatives of cdf
wrt the parameters in general, although maybe if some family is assumed, such as location-scale distributions, some general results can be worked out.
The factor N
is constant wrt the parameters so it's convenient to omit it. It's often more convenient, when working with optimization software, to search for a minimum instead of a maximum, so one can look at the negative log-likelihood instead of the log-likelihood, and minimize that instead of maximizing.
About maximizing wrt the parameters, Heathcote et al. make use of a particular line-search method, but that is not essential; given the NLL and its gradient, many optimization algorithms are applicable.
To get a starting place for the parameter search, I think it would be convenient to compute the ordinary complete-data maximum likelihood estimates, assuming that all n[i]
data in the i-th inter-quantile interval fall at the middle of the interval; such an estimate could be conveniently computed if the function to compute ML estimates allows for weighted data, since then the data are the interval midpoints with weights equal to the mass in each interval. Heathcote et al. find a starting point by estimating parameters by matching moments.
It has been interesting reading about this topic. Hope this helps someone else understand what's going on.
I'm going to venture an answer to my own question after some googling. One simple approach is to use binned Poisson maximum likelihood ratios. See p. 94-96 of this page:
http://www.hep.phy.cam.ac.uk/~thomson/lectures/statistics/FittingHandout.pdf
The likelihood ratio converges to a $\chi^2$ distribution in the large count limit, and if you're dealing with very few counts, you should do MC simulations to determine the empirical distribution of the likelihood ratio under the hypothesis that your histogram does indeed represent a collection of samples from the model distribution. You can determine a $p$-value from this simulated likelihood ratio distribution, and the $\chi^2$ test just represents a fast analytical approximation to this which is applicable in the large count limit.
All of this is nothing more than following the "perennial philosophy" of frequentist statistics:
If you think something interesting happened, you should find out how often you'd expect that thing to happen by chance before you go proclaiming to the world that it's interesting.
If the $p$-value shows that random effects are almost surely not responsible for the difference between your model and your observation, then your model is probably wrong.
Best Answer
One problematic feature is that there may be a continuum of optimal solutions. In most settings the quantiles are continuous functions of the parameters. When the distributions are continuous, almost surely there will be positive intervals between the data values. Suppose your objective function is optimized by a particular parameter value whose quantiles do not coincide exactly with any of the data: that is, they lie in the interiors of the intervals determined by the nearby data values. (This is an extremely likely event.) Then small changes in the parameter value will move the quantiles slightly, to remain within the same intervals, thereby leaving the chi-squared value unchanged because none of the counts changes. Thus the procedure doesn't even pick out a definite set of parameter values!
Another problematic feature is that this procedure apparently provides no way to obtain estimation errors for the parameters.
Another problem is that you do not know even the most basic properties of this estimator, such as its amount of bias.