My questions are:
- Are generalized linear models (GLMs) guaranteed to converge to a global maximum? If so, why?
- Furthermore, what constraints are there on the link function to insure convexity?
My understanding of GLMs is that they maximize a highly nonlinear likelihood function. Thus, I would imagine that there are several local maxima and the parameter set you converge to depends on the initial conditions for the optimization algorithm. However, after doing some research I have not found a single source which indicates that there are multiple local maxima. Furthermore, I am not so familiar with optimization techniques, but I know the Newton-Raphson method and IRLS algorithm are highly prone to local maxima.
Please explain if possible both on an intuitive and mathematical basis!
EDIT: dksahuji answered my original question, but I want to add the followup question [2] above. ("What constraints are there on the link function to insure convexity?")
Best Answer
The definition of exponential family is:
$$ p(x|\theta) = h(x)\exp(\theta^T\phi(x) - A(\theta)), $$
where $A(\theta)$ is the log partition function. Now one can prove that the following three things hold for 1D case (and they generalize to higher dimensions--you can look into properties of exponential families or log partition):
$ \frac{dA}{d\theta} = \mathbb{E}[\phi(x)]$
$ \frac{d^2A}{d\theta^2} = \mathbb{E}[\phi^2(x)] -\mathbb{E}[\phi(x)]^2 = {\rm var}(\phi(x)) $
$ \frac{ \partial ^2A}{\partial\theta_i\partial\theta_j} = \mathbb{E}[\phi_i(x)\phi_j(x)] -\mathbb{E}[\phi_i(x)] \mathbb{E}[\phi_j(x)] = {\rm cov}(\phi(x)) \Rightarrow \Delta^2A(\theta) = {\rm cov}(\phi(x))$
The above result prove that $A(\theta)$ is convex(as ${\rm cov}(\phi(x))$ is positive semidefinite). Now we take a look at likelihood function for MLE:
\begin{align} p(\mathcal{D}|\theta) &= \bigg[\prod_{i=1}^{N}{h(x_i)}\bigg]\ \exp\!\big(\theta^T[\sum_{i=1}^{N}\phi(x_i)] - NA(\theta)\big) \\ \log\!\big(p(\mathcal{D}|\theta)\big) &= \theta^T\bigg[\sum_{i=1}^{N}\phi(x_i)\bigg] - NA(\theta) \\ &= \theta^T[\phi(\mathcal{D})] - NA(\theta) \end{align}
Now $\theta^T[\phi(\mathcal{D})]$ is linear in theta and $-A(\theta)$ is concave. Therefore, there is a unique global maximum.
There is a generalized version called curved exponential family which would also be similar. But most of the proofs are in canonical form.