1. In pp. 16 they mention that the package uses "BFGS" or "Nelder-Mead". The second one is the default option. Please, take a look at these links to see their differences. Optimisation is actually made on the likelihood function. This is, the fitted parameters in the output of vgFit
are actually the maximum likelihood estimators.
2. It is difficult to tell which optimisation method is better in general. You can instead compare different methods and see if the results coincide using the command optim
and the command vgFit
. Next, I present a code for maximising the likelihood function, you can choose between 6 different optimisation methods.
library("VarianceGamma")
# Simulate 100 observations from a variance-gamma distribution with parameters (0,1,0,1)
data = rvg(100, vgC = 0, sigma = 1, theta = 0, nu = 1)
# -log-likelihood function using the Variance-Gamma package
ll = function(par){
if(par[2]>0&par[4]>0) return( - sum(log(dvg(data, vgC = par[1], sigma=par[2],
theta=par[3], nu = par[4]) )))
else return(Inf)}
# Direct maximisation/minimisation using the command optim
optim(c(0,1,0,1),ll,method = c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN", "Brent"))
# Maximisation using the command vgFit
vgFit(data)
The advantage of vgFit
is that you do not need to specify the starting value for searching the optimum. It implements three different methods for doing so: "US", "SL" and "MoM". "SL" is the default method. I would not trust blindly on these methods, I would rather compare the results from vgFit
and optim
.
3. You can check the references in the manual. For example
Seneta, E. (2004). Fitting the variance-gamma model to financial data. J. Appl. Prob., 41A:177–
187.
Kotz, S, Kozubowski, T. J., and Podgórski, K. (2001). The Laplace Distribution and Generalizations. Birkhauser, Boston, 349 p.
D.B. Madan and E. Seneta (1990): The variance gamma (V.G.) model for share market returns, Journal of Business, 63, pp. 511–524.
The unnormalized posterior (prior times likelihood, i.e., the numerator) in your example is
\begin{equation}
f(X=2\mid \mu)\cdot P(\mu) = \left\{
\begin{array}{c}
\frac{\mu_j^2}{2}e^{-\mu_j} \cdot P(\mu_j), & \quad ~\mu=\mu_j,j\in\{1,2,3\} \\
0, & \quad \mu \notin \{\mu_1,\mu_2,\mu_3\}
\end{array}
\right.
\end{equation}
the logarithm of this is
\begin{equation}
\left\{
\begin{array}{c}
-\log(2) + 2\log(\mu_j) - \mu_j + \log(P(\mu_j)) & \quad ~\mu=\mu_j,j\in\{1,2,3\} \\
-\infty & \quad \mu \notin \{\mu_1,\mu_2,\mu_3\}
\end{array}
\right.,
\end{equation}
i.e., the logarithm is undefined except at these 3 points. Therefore, you cannot take the derivative with respect to $\mu$ (your error lies in thinking you can take the derivative and omit the $P(\mu)$ factor, but this would correspond to using a uniform prior for $\mu$). Instead, you maximize the posterior by computing the posterior, or the logarithm, at those 3 points where it is nonzero and picking the maximum. In this case, the computation using logarithms works out as follows:
\begin{equation}
\begin{array}{cccc}
\textrm{Programmer} & \mu & \textrm{Log-likelihood} & \textrm{Log-prior} & \textrm{Sum} \\
\textrm{Mary} & 3 & -1.50 & -1.39 & \mathbf{-2.88} \\
\textrm{Tom} & 5 & -2.47 & -1.39 & -3.86 \\
\textrm{Jane} & 0.5 & -2.58 & -0.69 & -3.27 \\
\textrm{-} & \textrm{other} & -\log(2)+2\log(\mu)-\mu & -\infty & -\infty
\end{array}
\end{equation}
i.e., $\mu=3$ (Mary) is indeed the MAP estimate.
Best Answer
1st Example
A typical case is tagging in the context of natural language processing. See here for a detailed explanation. The idea is basically to be able to determine the lexical category of a word in a sentence (is it a noun, an adjective,...). The basic idea is that you have a model of your language consisting on a hidden markov model (HMM). In this model, the hidden states correspond to the lexical categories, and the observed states to the actual words.
The respective graphical model has the form,
where $\mathbf{y} = (y1,...,y_{N})$ is the sequence of words in the sentence, and $\mathbf{x} = (x1,...,x_{N})$ is the sequence of tags.
Once trained, the goal is to find the correct sequence of lexical categories that correspond to a given input sentence. This is formulated as finding the sequence of tags which are most compatible/most likely to have been generated by the language model, i.e.
$$f(y) = \mathbf{argmax}_{\mathbf{x} \in Y}p(\mathbf{x})p(\mathbf{y}|\mathbf{x})$$
2nd Example
Actually, a better example would be regression. Not only because it is easier to understand, but also because makes the differences between maximum likelihood (ML) and maximum a posteriori (MAP) clear.
Basically, the problem is that of fitting some function given by the samples $t$ with a linear combination of a set of basis functions, $$y(\mathbf{x};\mathbf{w}) = \sum_{i}w_{i}\phi_{i}(\mathbf{x})$$ where $\phi(\mathbf{x})$ are the basis functions, and $\mathbf{w}$ are the weights. It is usually assumed that the samples are corrupted by Gaussian noise. Hence, if we assume that the target function can be exactly written as such a linear combination, then we have,
$$t = y(\mathbf{x};\mathbf{w}) + \epsilon$$
so we have $p(t|\mathbf{w}) = \mathcal{N}(t|y(\mathbf{x};\mathbf{w}))$ The ML solution of this problem is equivalent to minimizing,
$$E(\mathbf{w}) = \frac{1}{2}\sum_{n}\left(t_{n} - \mathbf{w}^{T}\phi(\mathbf{x}_{n}) \right)^{2}$$
which yields the well-known least square error solution. Now, ML is sentitive to noise, and under certain circumstances not stable. MAP allows you to pick up better solutions by putting constraints on the weights. For example, a typical case is ridge regression, where you demand the weights to have a norm as small as possible,
$$E(\mathbf{w}) = \frac{1}{2}\sum_{n}\left(t_{n} - \mathbf{w}^{T}\phi(\mathbf{x}_{n}) \right)^{2} + \lambda \sum_{k}w_{k}^{2}$$
which is equivalent to setting a Gaussian prior on the weights $\mathcal{N}(\mathbf{w}|\mathbf{0},\lambda^{-1}\mathbf{I})$. In all, the estimated weights are
$$\mathbf{w} = \mathbf{argmin}_{w}p(\mathbf{w};\lambda)p(t|\mathbf{w};\phi)$$
Notice that in MAP the weights are not parameters as in ML, but random variables. Nevertheless, both ML and MAP are point estimators (they return an optimal set of weights, rather than a distribution of optimal weights).