Some years ago I wrote a paper about this for my students (in spanish), so I can try to rewrite those explanations here. I will look at IRLS (iteratively reweighted least squares) through a series of examples of increasing complexity. For the first example we need the concept of a location-scale family. Let $f_0$ be a density function centered at zero in some sense. We can construct a family of densities by defining
$$
f(x)= f(x;\mu,\sigma)= \frac{1}{\sigma} f_0\left(\frac{x-\mu}{\sigma}\right)
$$
where $\sigma > 0$ is a scale parameter and $\mu$ is a location parameter. In the measurement error model, where usual the error term is modeled as a normal distribution, we can in the place of that normal distribution use a location-scale family as constructed above. When $f_0$ is the standard normal distribution, the construction above gives the $\text{N}(\mu, \sigma)$ family.
Now we will use IRLS on some simple examples. First we will find the ML (maximum likelihood) estimators in the model
$$
Y_1,Y_2,\ldots,Y_n \hspace{1em} \text{i.i.d}
$$
with the density
$$
f(y)= \frac{1}{\pi} \frac{1}{1+(y-\mu)^2},\hspace{1em} y\in{\mathbb R},
$$
the Cauchy distribution the location family $\mu$ (so this is a location family). But first some notation. The weighted least squares estimator of $\mu$ is given by
$$
\mu^{\ast} = \frac{\sum_{i=1}^n w_i y_i}
{\sum_{i=1}^n w_i}.
$$
where $w_i$ is some weights. We will see that the ML estimator of $\mu$ can be expressed in the same form, with $w_i$ some function of the residuals
$$
\epsilon_i = y_i-\hat{\mu}.
$$
The likelihood function is given by
$$
L(y;\mu)= \left(\frac{1}{\pi}\right)^n \prod_{i=1}^n \frac{1}{1+(y_i-\mu)^2}
$$
and the loglikelihood function is given by
$$
l(y)= -n \log(\pi) - \sum_{i=1}^n \log\left(1+(y_i-\mu)^2\right).
$$
Its derivative with respect to $\mu$ is
$$
\begin{eqnarray}
\frac{\partial l(y)}{\partial \mu}&=&
0-\sum \frac{\partial}{\partial \mu} \log\left(1+(y_i-\mu)^2\right) \nonumber \\
&=& -\sum \frac{2(y_i-\mu)}{1+(y_i-\mu)^2}\cdot (-1) \nonumber \\
&=& \sum \frac{2 \epsilon_i}{1+\epsilon_i^2} \nonumber
\end{eqnarray}
$$
where $\epsilon_i=y_i-\mu$. Write $f_0(\epsilon)= \frac{1}{\pi} \frac{1}{1+\epsilon^2}$ and $f_0'(\epsilon)=\frac{1}{\pi} \frac{-1\cdot 2 \epsilon}{(1+\epsilon^2)^2}$, we get
$$
\frac{f_0'(\epsilon)}{f_0(\epsilon)} =
\frac{\frac{-1 \cdot2\epsilon}{(1+\epsilon^2)^2}}
{\frac{1}{1+\epsilon^2}} = -\frac{2\epsilon}{1+\epsilon^2}.
$$
We find
$$
\begin{eqnarray}
\frac {\partial l(y)} {\partial \mu}
& =& -\sum \frac {f_0'(\epsilon_i)} {f_0(\epsilon_i)} \nonumber \\
&=& -\sum \frac {f_0'(\epsilon_i)} {f_0(\epsilon_i)} \cdot
\left(-\frac{1}{\epsilon_i}\right)
\cdot (-\epsilon_i) \nonumber \\
&=& \sum w_i \epsilon_i \nonumber
\end{eqnarray}
$$
where we used the definition
$$
w_i= \frac{f_0'(\epsilon_i)}
{f_0(\epsilon_i)} \cdot \left(-\frac{1}{\epsilon_i}\right)
= \frac{-2 \epsilon_i}
{1+\epsilon_i^2} \cdot \left(-\frac{1}{\epsilon_i}\right)
= \frac{2}{1+\epsilon_i^2}.
$$
Remembering that
$\epsilon_i=y_i-\mu$ we obtain the equation
$$
\sum w_i y_i = \mu \sum w_i,
$$
which is the estimating equation of IRLS. Note that
- The weights $w_i$ are always positive.
- If the residual is large, we give less weight to the corresponding observation.
To calculate the ML estimator in practice, we need a start value $\hat{\mu}^{(0)}$, we could use the median, for example. Using this value we calculate residuals
$$
\epsilon_i^{(0)} = y_i - \hat{\mu}^{(0)}
$$
and weights
$$
w_i^{(0)} = \frac{2}{1+\epsilon_i^{(0)} }.
$$
The new value of $\hat{\mu}$ is given by
$$
\hat{\mu}^{(1)} = \frac{\sum w_i^{(0)} y_i}
{\sum w_i^{(0)} }.
$$
Continuing in this way we define
$$
\epsilon_i^{(j)} = y_i- \hat{\mu}^{(j)}
$$ and
$$
w_i^{(j)} = \frac{2}{1+\epsilon_i^{(j)} }.
$$
The estimated value at the pass $j+1$ of the algorithm becomes
$$
\hat{\mu}^{(j+1)} = \frac{\sum w_i^{(j)} y_i}
{\sum w_i^{(j)} }.
$$
Continuing until the sequence
$$
\hat{\mu}^{(0)}, \hat{\mu}^{(1)}, \ldots, \hat{\mu}^{(j)}, \ldots
$$
converges.
Now we studies this process with a more general location and scale family, $f(y)= \frac{1}{\sigma} f_0(\frac{y-\mu}{\sigma})$, with less detail.
Let $Y_1,Y_2,\ldots,Y_n$ be independent with the density above. Define also $ \epsilon_i=\frac{y_i-\mu}{\sigma}$. The loglikelihood function is
$$
l(y)= -\frac{n}{2}\log(\sigma^2) + \sum \log(f_0\left(\frac{y_i-\mu}{\sigma}\right)).
$$
Writing $\nu=\sigma^2$, note that
$$
\frac{\partial \epsilon_i}{\partial \mu} =
-\frac{1}{\sigma}
$$
and
$$
\frac{\partial \epsilon_i}{\partial \nu} =
(y_i-\mu)\left(\frac{1}{\sqrt{\nu}}\right)' =
(y_i-\mu)\cdot \frac{-1}{2 \sigma^3}.
$$
Calculating the loglikelihood derivative
$$
\frac{\partial l(y)}{\partial \mu} =
\sum \frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot
\frac{\partial \epsilon_i}{\partial \mu} =
\sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot\left(-\frac{1}{\sigma}\right)=
-\frac{1}{\sigma}\sum\frac{f_o'(\epsilon_i)}{f_0(\epsilon_i)}\cdot
\left(-\frac{1}{\epsilon_i}\right)(-\epsilon_i) =
\frac{1}{\sigma}\sum w_i \epsilon_i
$$
and equaling this to zero gives the same estimating equation as the first example. Then searching for an estimator for $\sigma^2$:
$$
\begin{eqnarray}
\frac{\partial l(y)}{\partial \nu} &=& -\frac{n}{2}\frac{1}{\nu} +
\sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot
\frac{\partial \epsilon_i}{\partial\nu} \nonumber \\
&=& -\frac{n}{2}\frac{1}{\nu}+\sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}
\cdot \left(-\frac{(y_i-\mu)}{2\sigma^3}\right) \nonumber \\
&=& -\frac{n}{2}\frac{1}{\nu} - \frac{1}{2}\frac{1}{\sigma^2}
\sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \epsilon_i\nonumber \\
&=& -\frac{n}{2}\frac{1}{\nu}-\frac{1}{2}\frac{1}{\nu}
\sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot
\left(-\frac{1}{\epsilon_i}\right)
(-\epsilon_i)\cdot\epsilon_i\nonumber \\
&=& -\frac{n}{2}\frac{1}{\nu}+\frac{1}{2}\frac{1}{\nu}\sum w_i \epsilon_i^2
\stackrel{!}{=} 0. \nonumber
\end{eqnarray}
$$
leading to the estimator
$$
\hat{\sigma^2} = \frac{1}{n}\sum w_i (y_i-\hat{\mu})^2.
$$
The iterative algorithm above can be used in this case as well.
In the following we give a numerical example using R, for the double exponential model (with known scale) and with data y <- c(-5,-1,0,1,5)
. For this data the true value of the ML estimator is 0.
The initial value will be mu <- 0.5
. One pass of the algorithm is
iterest <- function(y, mu) {
w <- 1/abs(y - mu)
weighted.mean(y, w)
}
with this function you can experiment with doing the iterations "by hand"
Then the iterative algorithm can be done by
mu_0 <- 0.5
repeat {mu <- iterest(y, mu_0)
if (abs(mu_0 - mu) < 0.000001) break
mu_0 <- mu }
Exercise: If the model is a $t_k$ distribution with scale parameter $\sigma$ show the iterations are given by the weight
$$
w_i = \frac{k + 1}{k + \epsilon_i^2}.
$$
Exercise: If the density is logistic, show the weights are given by
$$
w(\epsilon) = \frac{ 1-e^\epsilon}{1+e^\epsilon} \cdot - \frac{1}{\epsilon}.
$$
For the moment I will leave it here, I will continue this post.
Yes. GEE and GLM will indeed have the same coefficients, but different standard errors. To check, run an example in R. I've taken this example from Chapter 25 of Applied Regression Analysis and Other Multivariable Methods, 5th by Kleinbaum, et. al (just because it's on my desk and references GEE and GLM):
library(geepack)
library(lme4)
#get book data from
mydf<-read.table("http://www.hmwu.idv.tw/web/bigdata/rstudio-readData/tab/ch25q04.txt", header=TRUE)
mydf<-data.frame(subj=mydf$subj, week=as.factor(mydf$week), fev=mydf$fev)
#Make 5th level the reference level to match book results
mydf$week<-relevel(mydf$week, ref="5")
#Fit GLM Mixed Model
mixed.model<-summary(lme4::lmer(fev~week+(1|subj),data=mydf))
mixed.model$coefficients
Estimate Std. Error t value
(Intercept) 6.99850 0.2590243 27.01870247
week1 2.81525 0.2439374 11.54087244
week2 -0.15025 0.2439374 -0.61593680
week3 0.00325 0.2439374 0.01332309
week4 -0.04700 0.2439374 -0.19267241
#Fit a gee model with any correlation structure. In this case AR1
gee.model<-summary(geeglm(fev~week, id=subj, waves=week, corstr="ar1", data=mydf))
gee.model$coefficients
[Estimate Std.err Wald Pr(>|W|)
(Intercept) 6.99850 0.2418413 8.374312e+02 0.0000000
week1 2.81525 0.2514376 1.253642e+02 0.0000000
week2 -0.15025 0.2051973 5.361492e-01 0.4640330
week3 0.00325 0.2075914 2.451027e-04 0.9875090
week4 -0.04700 0.2388983 3.870522e-02 0.8440338][1]
UPDATE
As Mark White pointed out in his comment, I did indeed previously fit a "single-level" Mixed Effects GLM. Since you didn't specify whether you wanted a "fixed effects" or "random" effects GLM model, I just picked "random" since that's the model fit in the book I selected from. But indeed, Mark is right that the coefficients do not necessarily agree in multilevel models, and someone provided a nice answer about that question previously. For your reference, I've added a "fixed" effects GLM model below using lm
.
#Fit Traditional GLM Fixed Effect Model (i.e. not Random effects)
glm.fixed<-summary(lm(fev~week, data=mydf))
glm.fixed$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.99850 0.2590243 27.01870247 7.696137e-68
week1 2.81525 0.3663157 7.68531179 7.287752e-13
week2 -0.15025 0.3663157 -0.41016538 6.821349e-01
week3 0.00325 0.3663157 0.00887213 9.929302e-01
week4 -0.04700 0.3663157 -0.12830465 8.980401e-01
Note the first and second columns of the output in each model. They coefficients are identity, but standard errors differ.
You also added a comment which asked, "And does this remain the case when we choose a non-linear link function?" Note first that this is a different question since non-linear link functions generally aren't General Linear Models but Generalized Linear models. In this case, the coefficients do not necessarily match. Here's an example again in R:
#Fit Generalized Linear Mixed Effects Model with, say, Binomail Link
nlmixed.model<-summary(lme4::glmer(I(mydf$fev>mean(mydf$fev))~week+(1|subj), family="binomial", data=mydf))
nlmixed.model$coefficients
#Fit GEE model with, say, Binomial Link
nlgee.model<-summary(geeglm(I(mydf$fev>mean(mydf$fev))~week, id=subj, waves=week, family="binomial", data=mydf))
nlgee.model$coefficients
Best Answer
Fit an MLE by maximizing $$ l(\mathbf{\theta};\mathbf{y})=\sum_{i=1}^Nl{\left(\theta;y_i\right)} $$
where $l$ is the log-likelihood. Fitting an MLE with inverse-probability (i.e. frequency) weights entails modifying the log-likelihood to:
$$ l(\mathbf{\theta};\mathbf{y})=\sum_{i=1}^Nw_i~l{\left(\theta;y_i\right)}. $$
In the GLM case, this reduces to solving $$ \sum_{i=1}^N w_i\frac{y_i-\mu_i}{V(y_i)}\left(\frac{\partial\mu_i}{\partial\eta_i}x_{ij}\right)=0,~\forall j $$
Source: page 119 of http://www.ssicentral.com/lisrel/techdocs/sglim.pdf, linked at http://www.ssicentral.com/lisrel/resources.html#t. It's the "Generalized Linear Modeling" chapter (chapter 3) of the LISREL "technical documents."