As for your first question, one should define "standard", or acknowledge that a "canonical model" has been gradually established. As a comment indicated, it appears at least that the way you use IRWLS is rather standard.
As for your second question, "contraction mapping in probability" could be linked (however informally) to convergence of "recursive stochastic algorithms". From what I read, there is a huge literature on the subject mainly in Engineering. In Economics, we use a tiny bit of it, especially the seminal works of Lennart Ljung -the first paper was Ljung (1977)- which showed that the convergence (or not) of a recursive stochastic algorithm can be determined by the stability (or not) of a related ordinary differential equation.
(what follows has been re-worked after a fruitful discussion with the OP in the comments)
Convergence
I will use as reference Saber Elaydi "An Introduction to Difference Equations", 2005, 3d ed.
The analysis is conditional on some given data sample, so the $x's$ are treated as fixed.
The first-order condition for the minimization of the objective function, viewed as a recursive function in $m$,
$$m(k+1) = \sum_{i=1}^{N} v_i[m(k)] x_i, \;\; v_i[m(k)] \equiv \frac{w_i[m(k)]}{ \sum_{i=1}^{N} w_i[m(k)]} \qquad [1]$$
has a fixed point (the argmin of the objective function).
By Theorem 1.13 pp 27-28 of Elaydi, if the first derivative with respect to $m$ of the RHS of $[1]$, evaluated at the fixed point $m^*$, denote it $A'(m^*)$, is smaller than unity in absolute value, then $m^*$ is asymptotically stable (AS). More over by Theorem 4.3 p.179 we have that this also implies that the fixed point is uniformly AS (UAS).
"Asymptotically stable" means that for some range of values around the fixed point, a neighborhood $(m^* \pm \gamma)$, not necessarily small in size, the fixed point is attractive , and so if the algorithm gives values in this neighborhood, it will converge. The property being "uniform", means that the boundary of this neighborhood, and hence its size, is independent of the initial value of the algorithm. The fixed point becomes globally UAS, if $\gamma = \infty$.
So in our case, if we prove that
$$|A'(m^*)|\equiv \left|\sum_{i=1}^{N} \frac{\partial v_i(m^*)}{\partial m}x_i\right| <1 \qquad [2]$$
we have proven the UAS property, but without global convergence. Then we can either try to establish that the neighborhood of attraction is in fact the whole extended real numbers, or, that the specific starting value the OP uses as mentioned in the comments (and it is standard in IRLS methodology), i.e. the sample mean of the $x$'s, $\bar x$, always belongs to the neighborhood of attraction of the fixed point.
We calculate the derivative
$$\frac{\partial v_i(m^*)}{\partial m} = \frac {\frac{\partial w_i(m^*)}{\partial m}\sum_{i=1}^{N} w_i(m^*)-w_i(m^*)\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}}{\left(\sum_{i=1}^{N} w_i(m^*)\right)^2}$$
$$=\frac 1{\sum_{i=1}^{N} w_i(m^*)}\cdot\left[\frac{\partial w_i(m^*)}{\partial m}-v_i(m^*)\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}\right]$$
Then
$$A'(m^*) = \frac 1{\sum_{i=1}^{N} w_i(m^*)}\cdot\left[\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}x_i-\left(\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}\right)\sum_{i=1}^{N}v_i(m^*)x_i\right]$$
$$=\frac 1{\sum_{i=1}^{N} w_i(m^*)}\cdot\left[\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}x_i-\left(\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}\right)m^*\right]$$
and
$$|A'(m^*)| <1 \Rightarrow \left|\sum_{i=1}^{N}\frac{\partial w_i(m^*)}{\partial m}(x_i-m^*)\right| < \left|\sum_{i=1}^{N} w_i(m^*)\right| \qquad [3]$$
we have
$$\begin{align}\frac{\partial w_i(m^*)}{\partial m} = &\frac{-\rho''(|x_i-m^*|)\cdot \frac {x_i-m^*}{|x_i-m^*|}|x_i-m^*|+\frac {x_i-m^*}{|x_i-m^*|}\rho'(|x_i-m^*|)}{|x_i-m^*|^2} \\
\\
&=\frac {x_i-m^*}{|x_i-m^*|^3}\rho'(|x_i-m^*|) - \rho''(|x_i-m^*|)\cdot \frac {x_i-m^*}{|x_i-m^*|^2} \\
\\
&=\frac {x_i-m^*}{|x_i-m^*|^2}\cdot \left[\frac {\rho'(|x_i-m^*|)}{|x_i-m^*|}-\rho''(|x_i-m^*|)\right]\\
\\
&=\frac {x_i-m^*}{|x_i-m^*|^2}\cdot \left[w_i(m^*)-\rho''(|x_i-m^*|)\right]
\end{align}$$
Inserting this into $[3]$ we have
$$\left|\sum_{i=1}^{N}\frac {x_i-m^*}{|x_i-m^*|^2}\cdot \left[w_i(m^*)-\rho''(|x_i-m^*|)\right](x_i-m^*)\right| < \left|\sum_{i=1}^{N} w_i(m^*)\right|$$
$$\Rightarrow \left|\sum_{i=1}^{N}w_i(m^*)-\sum_{i=1}^{N}\rho''(|x_i-m^*|)\right| < \left|\sum_{i=1}^{N} w_i(m^*)\right| \qquad [4]$$
This is the condition that must be satisfied for the fixed point to be UAS. Since in our case the penalty function is convex, the sums involved are positive. So condition $[4]$ is equivalent to
$$\sum_{i=1}^{N}\rho''(|x_i-m^*|) < 2\sum_{i=1}^{N}w_i(m^*) \qquad [5]$$
If $\rho(|x_i-m|)$ is Hubert's loss function, then we have a quadratic ($q$) and a linear ($l$) branch,
$$\rho(|x_i-m|)=\cases{ (1/2)|x_i- m|^2 \qquad\;\;\;\; |x_i-m|\leq \delta \\
\\
\delta\big(|x_i-m|-\delta/2\big) \qquad |x_i-m|> \delta}$$
and
$$\rho'(|x_i-m|)=\cases{ |x_i- m| \qquad |x_i-m|\leq \delta \\
\\
\delta \qquad \qquad \;\;\;\; |x_i-m|> \delta}$$
$$\rho''(|x_i-m|)=\cases{ 1\qquad |x_i-m|\leq \delta \\
\\
0 \qquad |x_i-m|> \delta} $$
$$\cases{ w_{i,q}(m) =1\qquad \qquad \qquad |x_i-m|\leq \delta \\
\\
w_{i,l}(m) =\frac {\delta}{|x_i-m|} <1 \qquad |x_i-m|> \delta} $$
Since we do not know how many of the $|x_i-m^*|$'s place us in the quadratic branch and how many in the linear, we decompose condition $[5]$ as ($N_q + N_l = N$)
$$\sum_{i=1}^{N_q}\rho_q''+\sum_{i=1}^{N_l}\rho_l'' < 2\left[\sum_{i=1}^{N_q}w_{i,q} +\sum_{i=1}^{N_l}w_{i,l}\right]$$
$$\Rightarrow N_q + 0 < 2\left[N_q +\sum_{i=1}^{N_l}w_{i,l}\right] \Rightarrow 0 < N_q+2\sum_{i=1}^{N_l}w_{i,l}$$
which holds. So for the Huber loss function the fixed point of the algorithm is uniformly asymptotically stable, irrespective of the $x$'s. We note that the first derivative is smaller than unity in absolute value for any $m$, not just the fixed point.
What we should do now is either prove that the UAS property is also global, or that, if $m(0) = \bar x$ then $m(0)$ belongs to the neighborhood of attraction of $m^*$.
Best Answer
This might come as a slight anti-climax but essentially prior to fitting we remove any dependent columns of $X$ (or $XW$ in the case of a weighted task) so that $X$ is of full rank. For each IRLS iteration we compute a new vector of coefficient estimates, so computationally we just care for this (potentially weighted) fit to be numerically stable. The "iteration" part is somewhat immaterial.
The actual "rank correction" is relatively straightforward using $QR$ decomposition with column pivoting. This is a decomposition such that $X = QRP^T$; $Q$ is a unitary matrix, $R$ is an upper triangular matrix and $P$ is the permutation matrix such that diagonal elements of $R$ are non-increasing. If $QR$ suggests that the matrix $X$ (or $XW$ if we have weights) is not of full rank $k_f$ but of rank say $k_s$ (where $k_s < k_f$) we keep $k_s$ columns from the design matrix $X$. To pick the columns to retain we use $P$ and get the columns of $X$ denoted by the first $k_s$ columns of $P$. And then we are done - this "thinner" $X$ should be full rank so the show is over. :) Both R and MATLAB employ QR with pivoting in their
glm.fit
andglmfit
functions respectively to handle this kind of rank-deficiencies. MATLAB is slightly nicer and removes these "offending columns" prior to the IRLS iteration while R explicitly sets certain factors to NA's after the IRLS iteration but the idea and approach is the same.Addition based on comment: In general there is a multitude of approaches for fitting logistic regression models; T. Minka has written a nice overview here, it provides a number of Big O results. Particularly to IRLS, IRLS turns out to be equivalent to the use of Newton's method; the catch is that we use the expected Hessian of the Bernoulli likelihood (ie. the Fisher information matrix) instead of the actual Hessian; this leads to name of Fisher scoring method. This suggest a quadratic convergence with $O(nd^2)$ time complexity per iteriation. It goes without saying this does not guarantee convergence; for example successive iteration might lead to ever increasing MLE solution $\beta$ in cases where the likelihood does not have finite maximum. Use-cases of complete separation are prime examples and CV has two very enlightening post on the matter here and here to get you started. The course notes by C. Shalizi on Logistic Regression and Newton’s Method also quite good regarding the numeric aspects of this (the recommended reading in the notes (Faraway's awesome Extending the Linear Model with R, Chapter 2), has no detailed numerics on this unfortunately). About 5 years ago I read parts of Å. Björck's Numerical Methods for Least Squares Problems - it is not a Stats book but if you care about the Numerical Linear Algebra (ie. speed and stability) it is the good bet.
For something somewhat more formal than Wikipedia but still accessible I would suggest looking at M. Mueller's Generalized Linear Models contributed chapter for Springer's Handbook of Computational Statistics. You want to focus at Sect. 3.3. (or 24.3.3 if you can afford Springer's Handbook). CV itself has already some excellent posts on the $QR$ decomposition and its use on linear regression problem, see here and here. I also think you will find this thread on "How to correctly implement iteratively reweighted least squares algorithm for multiple logistic regression?" quite helpful.