Solved – What are some reasons iteratively reweighted least squares would not converge when used for logistic regression

convergencegeneralized linear modelirlslogisticr

I've been using the glm.fit function in R to fit parameters to a logistic regression model. By default, glm.fit uses iteratively reweighted least squares to fit the parameters. What are some reasons this algorithm would fail to converge, when used for logistic regression?

Best Answer

In case the two classes are separable, iteratively reweighted least squares (IRLS) would break. In such a scenario, any hyperplane that separates the two classes is a solution and there are infinitely many of them. IRLS is meant to find a maximum likelihood solution. Maximum likelihood does not have a mechanism to favor any of these solutions over the other (e.g. no concept of maximum margin). Depending on the initialization, IRLS should go toward one of these solutions and would break due to numerical problems (don't know the details of IRLS; an educated guess).

Another problem arises in case of linear-separability of the training data. Any of the hyperplane solutions corresponds to a heaviside function. Therefore, all the probabilities are either 0 or 1. The linear regression solution would be a hard classifier rather than a probabilistic classifier.

To clarify using mathematical notation, the heaviside function is $\lim_{|\mathbf{w}| \rightarrow \infty}\sigma(\mathbf{w}^T x + b)$, the limit of sigmoid function, where $\sigma$ is the sigmoid function and $(\mathbf{w}, b)$ determines the hyperplane solution. So IRLS theoretically does not stop and goes toward a $\mathbf{w}$ with increasing magnitude but would break in practice due to numerical problems.

Related Solutions

Solved – Logistic regression model does not converge

glm() uses an iterative re-weighted least squares algorithm. The algorithm hit the maximum number of allowed iterations before signalling convergence. The default, documented in ?glm.control is 25. You pass control parameters as a list in the glm call:

delay.model <- glm(BigDelay ~ ArrDelay, data=flights, family=binomial,
                   control = list(maxit = 50))

As @Conjugate Prior says, you seem to be predicting the response with the data used to generate it. You have complete separation as any ArrDelay < 10 will predict FALSE and any ArrDelay >= 10 will predict TRUE. The other warning message tells you that the fitted probabilities for some observations were effectively 0 or 1 and that is a good indicator you have something wrong with the model.

The two warnings can go hand in hand. The likelihood function can be quite flat when some $\hat{\beta}_i$ get large, as in your example. If you allow more iterations, the model coefficients will diverge further if you have a separation issue.

Solved – Definition and Convergence of Iteratively Reweighted Least Squares

As for your first question, one should define "standard", or acknowledge that a "canonical model" has been gradually established. As a comment indicated, it appears at least that the way you use IRWLS is rather standard.

As for your second question, "contraction mapping in probability" could be linked (however informally) to convergence of "recursive stochastic algorithms". From what I read, there is a huge literature on the subject mainly in Engineering. In Economics, we use a tiny bit of it, especially the seminal works of Lennart Ljung -the first paper was Ljung (1977)- which showed that the convergence (or not) of a recursive stochastic algorithm can be determined by the stability (or not) of a related ordinary differential equation.

(what follows has been re-worked after a fruitful discussion with the OP in the comments)

Convergence

I will use as reference Saber Elaydi "An Introduction to Difference Equations", 2005, 3d ed. The analysis is conditional on some given data sample, so the $x's$ are treated as fixed.

The first-order condition for the minimization of the objective function, viewed as a recursive function in $m$, $$m(k+1) = \sum_{i=1}^{N} v_i[m(k)] x_i, \;\; v_i[m(k)] \equiv \frac{w_i[m(k)]}{ \sum_{i=1}^{N} w_i[m(k)]} \qquad [1]$$

has a fixed point (the argmin of the objective function). By Theorem 1.13 pp 27-28 of Elaydi, if the first derivative with respect to $m$ of the RHS of $[1]$, evaluated at the fixed point $m^*$, denote it $A'(m^*)$, is smaller than unity in absolute value, then $m^*$ is asymptotically stable (AS). More over by Theorem 4.3 p.179 we have that this also implies that the fixed point is uniformly AS (UAS).
"Asymptotically stable" means that for some range of values around the fixed point, a neighborhood $(m^* \pm \gamma)$, not necessarily small in size, the fixed point is attractive , and so if the algorithm gives values in this neighborhood, it will converge. The property being "uniform", means that the boundary of this neighborhood, and hence its size, is independent of the initial value of the algorithm. The fixed point becomes globally UAS, if $\gamma = \infty$.
So in our case, if we prove that

$$|A'(m^*)|\equiv \left|\sum_{i=1}^{N} \frac{\partial v_i(m^*)}{\partial m}x_i\right| <1 \qquad [2]$$

we have proven the UAS property, but without global convergence. Then we can either try to establish that the neighborhood of attraction is in fact the whole extended real numbers, or, that the specific starting value the OP uses as mentioned in the comments (and it is standard in IRLS methodology), i.e. the sample mean of the $x$'s, $\bar x$, always belongs to the neighborhood of attraction of the fixed point.

We calculate the derivative $$\frac{\partial v_i(m^*)}{\partial m} = \frac {\frac{\partial w_i(m^*)}{\partial m}\sum_{i=1}^{N} w_i(m^*)-w_i(m^*)\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}}{\left(\sum_{i=1}^{N} w_i(m^*)\right)^2}$$

$$=\frac 1{\sum_{i=1}^{N} w_i(m^*)}\cdot\left[\frac{\partial w_i(m^*)}{\partial m}-v_i(m^*)\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}\right]$$ Then

$$A'(m^*) = \frac 1{\sum_{i=1}^{N} w_i(m^*)}\cdot\left[\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}x_i-\left(\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}\right)\sum_{i=1}^{N}v_i(m^*)x_i\right]$$

$$=\frac 1{\sum_{i=1}^{N} w_i(m^*)}\cdot\left[\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}x_i-\left(\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}\right)m^*\right]$$

and

$$|A'(m^*)| <1 \Rightarrow \left|\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}(x_i-m^*)\right| < \left|\sum_{i=1}^{N} w_i(m^*)\right| \qquad [3]$$

we have

$$\begin{align}\frac{\partial w_i(m^*)}{\partial m} = &\frac{-\rho''(|x_i-m^*|)\cdot \frac {x_i-m^*}{|x_i-m^*|}|x_i-m^*|+\frac {x_i-m^*}{|x_i-m^*|}\rho'(|x_i-m^*|)}{|x_i-m^*|^2} \\ \\ &=\frac {x_i-m^*}{|x_i-m^*|^3}\rho'(|x_i-m^*|) - \rho''(|x_i-m^*|)\cdot \frac {x_i-m^*}{|x_i-m^*|^2} \\ \\ &=\frac {x_i-m^*}{|x_i-m^*|^2}\cdot \left[\frac {\rho'(|x_i-m^*|)}{|x_i-m^*|}-\rho''(|x_i-m^*|)\right]\\ \\ &=\frac {x_i-m^*}{|x_i-m^*|^2}\cdot \left[w_i(m^*)-\rho''(|x_i-m^*|)\right] \end{align}$$

Inserting this into $[3]$ we have

$$\left|\sum_{i=1}^{N}\frac {x_i-m^*}{|x_i-m^*|^2}\cdot \left[w_i(m^*)-\rho''(|x_i-m^*|)\right](x_i-m^*)\right| < \left|\sum_{i=1}^{N} w_i(m^*)\right|$$

$$\Rightarrow \left|\sum_{i=1}^{N}w_i(m^*)-\sum_{i=1}^{N}\rho''(|x_i-m^*|)\right| < \left|\sum_{i=1}^{N} w_i(m^*)\right| \qquad [4]$$

This is the condition that must be satisfied for the fixed point to be UAS. Since in our case the penalty function is convex, the sums involved are positive. So condition $[4]$ is equivalent to

$$\sum_{i=1}^{N}\rho''(|x_i-m^*|) < 2\sum_{i=1}^{N}w_i(m^*) \qquad [5]$$

If $\rho(|x_i-m|)$ is Hubert's loss function, then we have a quadratic ($q$) and a linear ($l$) branch,

$$\rho(|x_i-m|)=\cases{ (1/2)|x_i- m|^2 \qquad\;\;\;\; |x_i-m|\leq \delta \\ \\ \delta\big(|x_i-m|-\delta/2\big) \qquad |x_i-m|> \delta}$$

and

$$\rho'(|x_i-m|)=\cases{ |x_i- m| \qquad |x_i-m|\leq \delta \\ \\ \delta \qquad \qquad \;\;\;\; |x_i-m|> \delta}$$

$$\rho''(|x_i-m|)=\cases{ 1\qquad |x_i-m|\leq \delta \\ \\ 0 \qquad |x_i-m|> \delta} $$

$$\cases{ w_{i,q}(m) =1\qquad \qquad \qquad |x_i-m|\leq \delta \\ \\ w_{i,l}(m) =\frac {\delta}{|x_i-m|} <1 \qquad |x_i-m|> \delta} $$

Since we do not know how many of the $|x_i-m^*|$'s place us in the quadratic branch and how many in the linear, we decompose condition $[5]$ as ($N_q + N_l = N$)

$$\sum_{i=1}^{N_q}\rho_q''+\sum_{i=1}^{N_l}\rho_l'' < 2\left[\sum_{i=1}^{N_q}w_{i,q} +\sum_{i=1}^{N_l}w_{i,l}\right]$$

$$\Rightarrow N_q + 0 < 2\left[N_q +\sum_{i=1}^{N_l}w_{i,l}\right] \Rightarrow 0 < N_q+2\sum_{i=1}^{N_l}w_{i,l}$$

which holds. So for the Huber loss function the fixed point of the algorithm is uniformly asymptotically stable, irrespective of the $x$'s. We note that the first derivative is smaller than unity in absolute value for any $m$, not just the fixed point.

What we should do now is either prove that the UAS property is also global, or that, if $m(0) = \bar x$ then $m(0)$ belongs to the neighborhood of attraction of $m^*$.

Best Answer

Related Solutions

Solved – Logistic regression model does not converge

Solved – Definition and Convergence of Iteratively Reweighted Least Squares

Related Question