GLM – Does Log Likelihood in GLM Guarantee Convergence to Global Maxima?

convergenceexponential-familygeneralized linear modeloptimization

My questions are:

Are generalized linear models (GLMs) guaranteed to converge to a global maximum? If so, why?
Furthermore, what constraints are there on the link function to insure convexity?

My understanding of GLMs is that they maximize a highly nonlinear likelihood function. Thus, I would imagine that there are several local maxima and the parameter set you converge to depends on the initial conditions for the optimization algorithm. However, after doing some research I have not found a single source which indicates that there are multiple local maxima. Furthermore, I am not so familiar with optimization techniques, but I know the Newton-Raphson method and IRLS algorithm are highly prone to local maxima.

Please explain if possible both on an intuitive and mathematical basis!

EDIT: dksahuji answered my original question, but I want to add the followup question [2] above. ("What constraints are there on the link function to insure convexity?")

Best Answer

The definition of exponential family is:

$$ p(x|\theta) = h(x)\exp(\theta^T\phi(x) - A(\theta)), $$

where $A(\theta)$ is the log partition function. Now one can prove that the following three things hold for 1D case (and they generalize to higher dimensions--you can look into properties of exponential families or log partition):

$ \frac{dA}{d\theta} = \mathbb{E}[\phi(x)]$
$ \frac{d^2A}{d\theta^2} = \mathbb{E}[\phi^2(x)] -\mathbb{E}[\phi(x)]^2 = {\rm var}(\phi(x)) $
$ \frac{ \partial ^2A}{\partial\theta_i\partial\theta_j} = \mathbb{E}[\phi_i(x)\phi_j(x)] -\mathbb{E}[\phi_i(x)] \mathbb{E}[\phi_j(x)] = {\rm cov}(\phi(x)) \Rightarrow \Delta^2A(\theta) = {\rm cov}(\phi(x))$

The above result prove that $A(\theta)$ is convex(as ${\rm cov}(\phi(x))$ is positive semidefinite). Now we take a look at likelihood function for MLE:

\begin{align} p(\mathcal{D}|\theta) &= \bigg[\prod_{i=1}^{N}{h(x_i)}\bigg]\ \exp\!\big(\theta^T[\sum_{i=1}^{N}\phi(x_i)] - NA(\theta)\big) \\ \log\!\big(p(\mathcal{D}|\theta)\big) &= \theta^T\bigg[\sum_{i=1}^{N}\phi(x_i)\bigg] - NA(\theta) \\ &= \theta^T[\phi(\mathcal{D})] - NA(\theta) \end{align}

Now $\theta^T[\phi(\mathcal{D})]$ is linear in theta and $-A(\theta)$ is concave. Therefore, there is a unique global maximum.

There is a generalized version called curved exponential family which would also be similar. But most of the proofs are in canonical form.

Best Answer

Related Solutions

Solved – Using L-BFGS-B optimization

Solved – Gradient descent based minimization algorithm that doesn’t require initial guess to be near the global optimum

Related Question