Solved – Choosing IRLS over gradient descent in logistic regression

irlsleast squareslogistic

I am currently reading Bishop [1] and got confusion on why should we take IRLS (Iterative Re-weighted Least Square) as it seems that using gradient descent that with one derivative at a time would solve the problem, what is the meaning of introducing Hessian matrix? Or did I understand anywhere wrong with that algorithm?

  1. Could it be faster? Why?
  2. What does "re-weighted" mean here? Isn't that gradient descent also updates their weight iteratively so the weights are also "re-weighted"?
  3. The method that IRLS takes is Newton-Raphson, which could give exactly the same result with standard least square solution in linear regression model as below.

$$ w_{new}\; =\; w_{old}\; -\; H^{-1}\nabla E(w)
$$

$$ \nabla E(w) = \sum_{n\; =\; 1}^{N}{\left( w^{T}\phi _{n}-t_{n} \right)}\phi _{n}\; =\; \phi ^{T}\phi w\; -\; \phi ^{T}t
$$

$$ \nabla\nabla E(w) = \sum_{n\; =\; 1}^{N}{\phi _{n}}\phi _{n}^{T}\; =\; \phi ^{T}\phi
$$

$$ w_{new}\; =\; w_{old}\; -\; \left( \phi ^{T}\phi \right)^{-1}\left\{ \phi ^{T}\phi w_{old}\; -\; \phi ^{T}t \right\}\; =\; \left( \phi ^{T}\phi \right)^{-1}\phi ^{T}t
$$

Therefore I just wonder, when the method once applied in linear regression, is there only one iteration that the algorithm is required to run?

[1]: Bishop, Christopher (2006),
Pattern Recognition and Machine Learning,
Springer-Verlag New York

Best Answer

  1. Yes, IRLS could be faster, as I said in my answer to your previous question. For example, if the log-likelihood is nearly quadratic (which it will usually be if you are able to start fairly close to the maximum and the sample size isn't very small), then it may converge in only a couple of steps. Note, in fact that on p240, Bishop says

    Although such an approach might intuitively seem reasonable, in fact it turns out to be a poor algorithm, for reasons discussed in Bishop and Nabney (2008)

  2. Notice the "LS" part. IRLS proceeds by performing weighted least squares, but the weights to observations are updated each step (re-weighted).

  3. The particular version of IRLS Bishop presents might possibly be Newton-Raphson, but this is not necessarily the case in general (it could be Fisher scoring for example, which is related to but slightly different from actual Newton Raphson). But yes, a single step of IRLS on a (already correctly-weighted) regression problem suffices.