Regression – Simple Intuitive Explanation of IRLS Method to Find MLE of a GLM

generalized linear modelirlslink-functionmaximum likelihoodregression

Background:

I'm trying to follow Princeton's review of MLE estimation for GLM.

I understand the basics of MLE estimation: likelihood, score, observed and expected Fisher information and the Fisher scoring technique. And I know how to justify simple linear regression with MLE estimation.


The question:

I can't understand even the first line of this method 🙁

What's the intuition behind the $z_i$ working variables defined as:

$$ z_i = \hat\eta_i + (y_i -\hat\mu_i)\frac{d\eta_i}{d\mu_i}$$

Why are they used instead of $y_i$ to estimate $\beta$?

And what's their relation with the response/link function which is the connection between $\eta$ and $\mu$

If anyone has a simple explanation or can direct me to a more basic-level text about this I would be grateful.

Best Answer

Some years ago I wrote a paper about this for my students (in spanish), so I can try to rewrite those explanations here. I will look at IRLS (iteratively reweighted least squares) through a series of examples of increasing complexity. For the first example we need the concept of a location-scale family. Let $f_0$ be a density function centered at zero in some sense. We can construct a family of densities by defining $$ f(x)= f(x;\mu,\sigma)= \frac{1}{\sigma} f_0\left(\frac{x-\mu}{\sigma}\right) $$ where $\sigma > 0$ is a scale parameter and $\mu$ is a location parameter. In the measurement error model, where usual the error term is modeled as a normal distribution, we can in the place of that normal distribution use a location-scale family as constructed above. When $f_0$ is the standard normal distribution, the construction above gives the $\text{N}(\mu, \sigma)$ family.

Now we will use IRLS on some simple examples. First we will find the ML (maximum likelihood) estimators in the model $$ Y_1,Y_2,\ldots,Y_n \hspace{1em} \text{i.i.d} $$ with the density $$ f(y)= \frac{1}{\pi} \frac{1}{1+(y-\mu)^2},\hspace{1em} y\in{\mathbb R}, $$ the Cauchy distribution the location family $\mu$ (so this is a location family). But first some notation. The weighted least squares estimator of $\mu$ is given by $$ \mu^{\ast} = \frac{\sum_{i=1}^n w_i y_i} {\sum_{i=1}^n w_i}. $$ where $w_i$ is some weights. We will see that the ML estimator of $\mu$ can be expressed in the same form, with $w_i$ some function of the residuals $$ \epsilon_i = y_i-\hat{\mu}. $$ The likelihood function is given by $$ L(y;\mu)= \left(\frac{1}{\pi}\right)^n \prod_{i=1}^n \frac{1}{1+(y_i-\mu)^2} $$ and the loglikelihood function is given by $$ l(y)= -n \log(\pi) - \sum_{i=1}^n \log\left(1+(y_i-\mu)^2\right). $$ Its derivative with respect to $\mu$ is $$ \begin{eqnarray} \frac{\partial l(y)}{\partial \mu}&=& 0-\sum \frac{\partial}{\partial \mu} \log\left(1+(y_i-\mu)^2\right) \nonumber \\ &=& -\sum \frac{2(y_i-\mu)}{1+(y_i-\mu)^2}\cdot (-1) \nonumber \\ &=& \sum \frac{2 \epsilon_i}{1+\epsilon_i^2} \nonumber \end{eqnarray} $$ where $\epsilon_i=y_i-\mu$. Write $f_0(\epsilon)= \frac{1}{\pi} \frac{1}{1+\epsilon^2}$ and $f_0'(\epsilon)=\frac{1}{\pi} \frac{-1\cdot 2 \epsilon}{(1+\epsilon^2)^2}$, we get $$ \frac{f_0'(\epsilon)}{f_0(\epsilon)} = \frac{\frac{-1 \cdot2\epsilon}{(1+\epsilon^2)^2}} {\frac{1}{1+\epsilon^2}} = -\frac{2\epsilon}{1+\epsilon^2}. $$ We find $$ \begin{eqnarray} \frac {\partial l(y)} {\partial \mu} & =& -\sum \frac {f_0'(\epsilon_i)} {f_0(\epsilon_i)} \nonumber \\ &=& -\sum \frac {f_0'(\epsilon_i)} {f_0(\epsilon_i)} \cdot \left(-\frac{1}{\epsilon_i}\right) \cdot (-\epsilon_i) \nonumber \\ &=& \sum w_i \epsilon_i \nonumber \end{eqnarray} $$ where we used the definition $$ w_i= \frac{f_0'(\epsilon_i)} {f_0(\epsilon_i)} \cdot \left(-\frac{1}{\epsilon_i}\right) = \frac{-2 \epsilon_i} {1+\epsilon_i^2} \cdot \left(-\frac{1}{\epsilon_i}\right) = \frac{2}{1+\epsilon_i^2}. $$ Remembering that $\epsilon_i=y_i-\mu$ we obtain the equation $$ \sum w_i y_i = \mu \sum w_i, $$ which is the estimating equation of IRLS. Note that

  1. The weights $w_i$ are always positive.
  2. If the residual is large, we give less weight to the corresponding observation.

To calculate the ML estimator in practice, we need a start value $\hat{\mu}^{(0)}$, we could use the median, for example. Using this value we calculate residuals $$ \epsilon_i^{(0)} = y_i - \hat{\mu}^{(0)} $$ and weights $$ w_i^{(0)} = \frac{2}{1+\epsilon_i^{(0)} }. $$ The new value of $\hat{\mu}$ is given by $$ \hat{\mu}^{(1)} = \frac{\sum w_i^{(0)} y_i} {\sum w_i^{(0)} }. $$ Continuing in this way we define $$ \epsilon_i^{(j)} = y_i- \hat{\mu}^{(j)} $$ and $$ w_i^{(j)} = \frac{2}{1+\epsilon_i^{(j)} }. $$ The estimated value at the pass $j+1$ of the algorithm becomes $$ \hat{\mu}^{(j+1)} = \frac{\sum w_i^{(j)} y_i} {\sum w_i^{(j)} }. $$ Continuing until the sequence $$ \hat{\mu}^{(0)}, \hat{\mu}^{(1)}, \ldots, \hat{\mu}^{(j)}, \ldots $$ converges.

Now we studies this process with a more general location and scale family, $f(y)= \frac{1}{\sigma} f_0(\frac{y-\mu}{\sigma})$, with less detail. Let $Y_1,Y_2,\ldots,Y_n$ be independent with the density above. Define also $ \epsilon_i=\frac{y_i-\mu}{\sigma}$. The loglikelihood function is $$ l(y)= -\frac{n}{2}\log(\sigma^2) + \sum \log(f_0\left(\frac{y_i-\mu}{\sigma}\right)). $$ Writing $\nu=\sigma^2$, note that $$ \frac{\partial \epsilon_i}{\partial \mu} = -\frac{1}{\sigma} $$ and $$ \frac{\partial \epsilon_i}{\partial \nu} = (y_i-\mu)\left(\frac{1}{\sqrt{\nu}}\right)' = (y_i-\mu)\cdot \frac{-1}{2 \sigma^3}. $$ Calculating the loglikelihood derivative $$ \frac{\partial l(y)}{\partial \mu} = \sum \frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \frac{\partial \epsilon_i}{\partial \mu} = \sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot\left(-\frac{1}{\sigma}\right)= -\frac{1}{\sigma}\sum\frac{f_o'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \left(-\frac{1}{\epsilon_i}\right)(-\epsilon_i) = \frac{1}{\sigma}\sum w_i \epsilon_i $$ and equaling this to zero gives the same estimating equation as the first example. Then searching for an estimator for $\sigma^2$: $$ \begin{eqnarray} \frac{\partial l(y)}{\partial \nu} &=& -\frac{n}{2}\frac{1}{\nu} + \sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \frac{\partial \epsilon_i}{\partial\nu} \nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu}+\sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)} \cdot \left(-\frac{(y_i-\mu)}{2\sigma^3}\right) \nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu} - \frac{1}{2}\frac{1}{\sigma^2} \sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \epsilon_i\nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu}-\frac{1}{2}\frac{1}{\nu} \sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \left(-\frac{1}{\epsilon_i}\right) (-\epsilon_i)\cdot\epsilon_i\nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu}+\frac{1}{2}\frac{1}{\nu}\sum w_i \epsilon_i^2 \stackrel{!}{=} 0. \nonumber \end{eqnarray} $$ leading to the estimator $$ \hat{\sigma^2} = \frac{1}{n}\sum w_i (y_i-\hat{\mu})^2. $$ The iterative algorithm above can be used in this case as well.

In the following we give a numerical example using R, for the double exponential model (with known scale) and with data y <- c(-5,-1,0,1,5). For this data the true value of the ML estimator is 0. The initial value will be mu <- 0.5. One pass of the algorithm is

      iterest <- function(y, mu) {
                   w <- 1/abs(y - mu)
                   weighted.mean(y, w)
                   }

with this function you can experiment with doing the iterations "by hand" Then the iterative algorithm can be done by

    mu_0 <- 0.5
    repeat {mu <- iterest(y, mu_0)
            if (abs(mu_0 - mu) < 0.000001) break
            mu_0 <- mu }

Exercise: If the model is a $t_k$ distribution with scale parameter $\sigma$ show the iterations are given by the weight $$ w_i = \frac{k + 1}{k + \epsilon_i^2}. $$ Exercise: If the density is logistic, show the weights are given by $$ w(\epsilon) = \frac{ 1-e^\epsilon}{1+e^\epsilon} \cdot - \frac{1}{\epsilon}. $$

For the moment I will leave it here, I will continue this post.

Related Question