Solved – Iteratively Reweighted Least squares for logistic regression when features are dependent

irlslogisticoptimizationregression

I was solving logistic regression using IRLS (wiki) described in the wiki link. Now I have a doubt, if $X$ has dependent features then $X^TS_kX$ will not have full rank and thus will not be invertible in that case will solving

$$argmin_w \| S_k^{1/2}Xw -(S_k^{1/2}Xw_k + S_k^{-1/2}(y-\mu_k)) \|$$

which is least squares formulation of the update equation, be the right thing to do. By right I mean the update moves in the descent direction. Also, if not then how should we handle such cases?

Best Answer

This might come as a slight anti-climax but essentially prior to fitting we remove any dependent columns of $X$ (or $XW$ in the case of a weighted task) so that $X$ is of full rank. For each IRLS iteration we compute a new vector of coefficient estimates, so computationally we just care for this (potentially weighted) fit to be numerically stable. The "iteration" part is somewhat immaterial.

The actual "rank correction" is relatively straightforward using $QR$ decomposition with column pivoting. This is a decomposition such that $X = QRP^T$; $Q$ is a unitary matrix, $R$ is an upper triangular matrix and $P$ is the permutation matrix such that diagonal elements of $R$ are non-increasing. If $QR$ suggests that the matrix $X$ (or $XW$ if we have weights) is not of full rank $k_f$ but of rank say $k_s$ (where $k_s < k_f$) we keep $k_s$ columns from the design matrix $X$. To pick the columns to retain we use $P$ and get the columns of $X$ denoted by the first $k_s$ columns of $P$. And then we are done - this "thinner" $X$ should be full rank so the show is over. :) Both R and MATLAB employ QR with pivoting in their glm.fit and glmfit functions respectively to handle this kind of rank-deficiencies. MATLAB is slightly nicer and removes these "offending columns" prior to the IRLS iteration while R explicitly sets certain factors to NA's after the IRLS iteration but the idea and approach is the same.

Addition based on comment: In general there is a multitude of approaches for fitting logistic regression models; T. Minka has written a nice overview here, it provides a number of Big O results. Particularly to IRLS, IRLS turns out to be equivalent to the use of Newton's method; the catch is that we use the expected Hessian of the Bernoulli likelihood (ie. the Fisher information matrix) instead of the actual Hessian; this leads to name of Fisher scoring method. This suggest a quadratic convergence with $O(nd^2)$ time complexity per iteriation. It goes without saying this does not guarantee convergence; for example successive iteration might lead to ever increasing MLE solution $\beta$ in cases where the likelihood does not have finite maximum. Use-cases of complete separation are prime examples and CV has two very enlightening post on the matter here and here to get you started. The course notes by C. Shalizi on Logistic Regression and Newton’s Method also quite good regarding the numeric aspects of this (the recommended reading in the notes (Faraway's awesome Extending the Linear Model with R, Chapter 2), has no detailed numerics on this unfortunately). About 5 years ago I read parts of Å. Björck's Numerical Methods for Least Squares Problems - it is not a Stats book but if you care about the Numerical Linear Algebra (ie. speed and stability) it is the good bet.

For something somewhat more formal than Wikipedia but still accessible I would suggest looking at M. Mueller's Generalized Linear Models contributed chapter for Springer's Handbook of Computational Statistics. You want to focus at Sect. 3.3. (or 24.3.3 if you can afford Springer's Handbook). CV itself has already some excellent posts on the $QR$ decomposition and its use on linear regression problem, see here and here. I also think you will find this thread on "How to correctly implement iteratively reweighted least squares algorithm for multiple logistic regression?" quite helpful.