These situations are understood more clearly when having in mind what minimisation or maximisation really is and how optimisation works.
Suppose we have function $f$ which has local minimum at $x_0$. Optimisation methods try to construct the sequence $x_i$ which converges to $x_0$. It is always shown that in theory the sequence constructed converges to the point of local minimum for some class of functions $f$.
To obtain next candidate in iteration $i$ can be a lengthy process, so it is usual that all algorithms limit the number of iterations. This corresponds to situation 4.
Then for each $x$ close to $x_0$ we have that $f(x)>f(x_0)$. So if $f(x_i)>f(x_{i-1})$ this is an indication that we reached the minimum. This corresponds to situation 3
Now if function $f$ has a derivative at $x_0$ then necessarily $\nabla f(x_0)=0$. Newton-Raphson method calculates gradient at each step, so if $\nabla f(x_i)\approx 0$, $x_i$ is probably a solution, which corresponds to situation 1.
Each convergent sequence of real vectors is Cauchy sequence and vice versa, roughly meaning that if $x_i$ is close to $x_0$, then $x_i$ is close to $x_{i+1}$ and vice versa, where $i$ is the iteration number. So if $|x_i-x_{i-1}|<\varepsilon$, and we know that in theory $x_i$ converges to $x_0$, then we should be close to the minimum point. This corresponds to situation 2.
Converging sequences have the property that they contract, i.e. if we are close to convergence all the remaining elements of the sequence are contained in small area. So if the sequence which in theory should converge starts to take large steps this is an indication that there is no convergence probably. This corresponds to situation 5
Note Strict mathematical definitions were left out intentionally.
If a prior probability is given as part of the problem setup, then use that information (i.e. use MAP). If no such prior information is given or assumed, then MAP is not possible, and MLE is a reasonable approach.
Best Answer
There are a number of general-purpose optimization routines in base R that I'm aware of:
optim
,nlminb
,nlm
andconstrOptim
(which handles linear inequality constraints, and callsoptim
under the hood). Here are some things that you might want to consider in choosing which one to use.optim
can use a number of different algorithms including conjugate gradient, Newton, quasi-Newton, Nelder-Mead and simulated annealing. The last two don't need gradient information and so can be useful if gradients aren't available or not feasible to calculate (but are likely to be slower and require more parameter fine-tuning, respectively). It also has an option to return the computed Hessian at the solution, which you would need if you want standard errors along with the solution itself.nlminb
uses a quasi-Newton algorithm that fills the same niche as the"L-BFGS-B"
method inoptim
. In my experience it seems a bit more robust thanoptim
in that it's more likely to return a solution in marginal cases whereoptim
will fail to converge, although that's likely problem-dependent. It has the nice feature, if you provide an explicit gradient function, of doing a numerical check of its values at the solution. If these values don't match those obtained from numerical differencing,nlminb
will give a warning; this helps to ensure you haven't made a mistake in specifying the gradient (easy to do with complicated likelihoods).nlm
only uses a Newton algorithm. This can be faster than other algorithms in the sense of needing fewer iterations to reach convergence, but has its own drawbacks. It's more sensitive to the shape of the likelihood, so if it's strongly non-quadratic, it may be slower or you may get convergence to a false solution. The Newton algorithm also uses the Hessian, and computing that can be slow enough in practice that it more than cancels out any theoretical speedup.