Solved – How to correctly implement iteratively reweighted least squares algorithm for multiple logistic regression

irlslogisticmultiple regression

I'm confused about the iteratively reweighted least squares algorithm used to solve for logistic regression coefficients as described on page 121 of The Elements of Statistical Learning, 2nd Edition (Hastie, Tibshirani, Friedman 2009).

The final step of the process, after fitting a Taylor approximation to the log-likelihood of $N$ observations, is to solve the following weighted least squares problem:

$\beta^{new}\leftarrow argmin_{\beta}(\textbf{z}-\textbf{X}\beta)^T\textbf{W}(\textbf{z}-\textbf{X}\beta)$ $(1)$

by finding $\frac{\delta[(\textbf{z}-\textbf{X}\beta)^T\textbf{W}(\textbf{z}-\textbf{X}\beta)]}{\delta\beta_j}$, setting $\frac{\delta[(\textbf{z}-\textbf{X}\beta)^T\textbf{W}(\textbf{z}-\textbf{X}\beta)]}{\delta\beta_j}=0$, then solving for $\beta_j^{new}$,

where:

$\textbf{z}=\textbf{X}\beta^{old}+\textbf{W}^{-1}(\textbf{y}-\textbf{p})$,

$\textbf{W}=N\times{}N$ diagonal matrix of weights with $i$th diagonal element $p(x_i;\beta^{old})(1-p(x_i;\beta^{old}))$,

$\textbf{p}=$vector of fitted probabilities with $i$th element $p(x_i;\beta^{old})$,

$\textbf{y}=$vector of $y_i$ values,

$\textbf{X}=$matrix of $x_i$ values,

$\beta=$vector of coefficients $\beta_0,\beta_1,…,\beta_p$.

In the right hand part of expression (1), the $\beta$s are missing any superscript. Is $\beta$ presumed to be equal to $\beta^{old}$? That is, in order to solve for $\beta_j^{new}$ in (1) do we plug in the most current update of $\beta$ for all values of $\beta_{l\neq j}$ calculated in prior steps?

Best Answer

In an expression like

$$ \beta^{new}\leftarrow \text{argmin}_{b}(\textbf{z}-\textbf{X}b)^T\textbf{W}(\textbf{z}-\textbf{X}b) $$

the point is that the output, $\beta^{new}$, is the result of considering all possible $b\in \mathbb{R}^p$ or whatever other space you are optimizing over. That's why there's no superscript: in the optimization problem $\beta$ is a dummy variable, just like with an integral (and I'm deliberately writing $b$ not $\beta$ to reflect $b$ being a dummy variable, not the target parameter).

The overall procedure involves getting a $\beta^{(t)}$, computing the "response" for the WLS, and then solving the WLS problem for $\beta^{(t+1)}$; as you know, we can use derivatives to get a nice closed-form solution for the optimal $\hat \beta$ for this problem. Thus $\beta^{old}$, which is fixed, appears in the vector $\textbf{z}$ in the WLS computation and then leads to $\beta^{new}$. That's the "iteration" part, that we use our current solution to create a new response vector; the WLS part then is solving for the new $\hat \beta$ vector. We keep doing this until there's no "significant" change.

Remember that the WLS procedure doesn't know that it is being used iteratively; as far as it is concerned, it is presented with an $X$, $y$, and $W$ and then outputs

$$ \hat{\beta} = (X^T W X)^{-1} X^T W y $$ like it would do in any other instance. We are being clever with our choice of $y$ and $W$ and iterating.

Update: We can derive the solution to the WLS problem without using any component-wise derivatives. Note that if $Y \sim \mathcal N(X\beta, I)$ then $W^{1/2}Y \sim \mathcal N(W^{1/2}X\beta, W)$ from which we have that $$ \frac{\text d}{\text d\beta}\|W^{1/2}Y - W^{1/2}X\beta\|^2 = -2X^TWY + 2X^TWX\beta. $$

Setting the derivative equal to 0 and solving we obtain

$$ \hat{\beta} = (X^TWX)^{-1} X^TWY. $$

Thus for any inputs $W$, $X$, and $Y$ (provided W is positive definite and $X$ is full column rank) we get our optimal $\hat{\beta}$. It doesn't matter what these inputs are. So what we do is we use our $\beta^{old}$ to create our $Y$ vector and then we plug that in to this formula which outputs the optimal $\hat \beta$ for the given inputs. The whole point of the WLS procedure is to solve for $\hat \beta$. It in and of itself doesn't require plugging in a $\hat \beta$.