Maximum-Likelihood – When to Avoid Using R’s nlm Function for MLE: Key Considerations

maximum likelihoodr

I've run across a couple guides suggesting that I use R's nlm for maximum likelihood estimation. But none of them (including R's documentation) gives much theoretical guidance for when to use or not use the function.

As far as I can tell, nlm is just doing gradient descent along the lines of Newton's method. Are there principles for when it's reasonable to use this approach? What alternatives are available? Also, are there limits on the size of the arrays, etc. one can pass to nlm?

Best Answer

There are a number of general-purpose optimization routines in base R that I'm aware of: optim, nlminb, nlm and constrOptim (which handles linear inequality constraints, and calls optim under the hood). Here are some things that you might want to consider in choosing which one to use.

optim can use a number of different algorithms including conjugate gradient, Newton, quasi-Newton, Nelder-Mead and simulated annealing. The last two don't need gradient information and so can be useful if gradients aren't available or not feasible to calculate (but are likely to be slower and require more parameter fine-tuning, respectively). It also has an option to return the computed Hessian at the solution, which you would need if you want standard errors along with the solution itself.
nlminb uses a quasi-Newton algorithm that fills the same niche as the "L-BFGS-B" method in optim. In my experience it seems a bit more robust than optim in that it's more likely to return a solution in marginal cases where optim will fail to converge, although that's likely problem-dependent. It has the nice feature, if you provide an explicit gradient function, of doing a numerical check of its values at the solution. If these values don't match those obtained from numerical differencing, nlminb will give a warning; this helps to ensure you haven't made a mistake in specifying the gradient (easy to do with complicated likelihoods).
nlm only uses a Newton algorithm. This can be faster than other algorithms in the sense of needing fewer iterations to reach convergence, but has its own drawbacks. It's more sensitive to the shape of the likelihood, so if it's strongly non-quadratic, it may be slower or you may get convergence to a false solution. The Newton algorithm also uses the Hessian, and computing that can be slow enough in practice that it more than cancels out any theoretical speedup.

Best Answer

Related Solutions

Solved – The code variable in the nlm() function

Solved – MLE vs MAP estimation, when to use which

Related Question