Solved – GLM analogue of weighted least squares

generalized linear modelquasi-likelihood

The short version:

I can fit a model using Weighted Least Squares, given a diagonal matrix of weights $W$, by solving $(X^TWX)\hat{\beta}=X^TWy$ for $\hat{\beta}$.

~~Is there a GLM analogue? if so, what is it?~~

There seems to be a GLM analogue, e.g. with the weights argument in R's glm function. How is R using these weights?

The long version:

the situation

As a follow-up to my IPTW question, I just want to double check that I understand how to fit a parametric model using inverse probability(-of-treatment) weights (IPTW). The idea with IPTW is to simulate a dataset in which the relationship between my independent variables $(a^1,a^2,a^3)$ and dependent variable $y$ is unconfounded and therefore causal. For argument's sake let's say I already estimated an IPT weight $\hat{w}_i$ for each observation. These weights are hypothetical probability weights from the simulated dataset.

the question

I now want to fit a GLM. I'd just use WLS, but I'm working with a binary outcome and an outcome truncated at zero. So I have a linear model $\eta_i=a^T\beta$, a link $\mu_i=g(\eta_i)$, and a variance $V(y_i)$ derived from my likelihood for $y$. Then the likelihood equations are
$$
\sum_{i=1}^N \frac{y_i-\mu_i}{V(y_i)}\frac{\partial\mu_i}{\partial\beta_j}=\sum_{i=1}^N \frac{y_i-\mu_i}{V(y_i)}\left(\frac{\partial\mu_i}{\partial\eta_i}x_{ij}\right)=0,~\forall j
$$ as per Categorical Data Analysis, Agresti, 2013, section 4.4.5.

So all I have to do is multiply $var(\mu_i)$ by the weight $\hat{w}_i$, right? The same way I might if I wanted to incorporate an overdispersion parameter? If so, is this because the variance of, say, 5 independent observations is 5 times the variance of one independent observation?

Follow-up idea: since the likelihood is the product of the likelihood for each observation, is there some weighting procedure I can use to just weight the likelihoods?

Best Answer

Fit an MLE by maximizing $$ l(\mathbf{\theta};\mathbf{y})=\sum_{i=1}^Nl{\left(\theta;y_i\right)} $$

where $l$ is the log-likelihood. Fitting an MLE with inverse-probability (i.e. frequency) weights entails modifying the log-likelihood to:

$$ l(\mathbf{\theta};\mathbf{y})=\sum_{i=1}^Nw_i~l{\left(\theta;y_i\right)}. $$

In the GLM case, this reduces to solving $$ \sum_{i=1}^N w_i\frac{y_i-\mu_i}{V(y_i)}\left(\frac{\partial\mu_i}{\partial\eta_i}x_{ij}\right)=0,~\forall j $$

Source: page 119 of http://www.ssicentral.com/lisrel/techdocs/sglim.pdf, linked at http://www.ssicentral.com/lisrel/resources.html#t. It's the "Generalized Linear Modeling" chapter (chapter 3) of the LISREL "technical documents."

Related Solutions

Regression – Simple Intuitive Explanation of IRLS Method to Find MLE of a GLM

Some years ago I wrote a paper about this for my students (in spanish), so I can try to rewrite those explanations here. I will look at IRLS (iteratively reweighted least squares) through a series of examples of increasing complexity. For the first example we need the concept of a location-scale family. Let $f_0$ be a density function centered at zero in some sense. We can construct a family of densities by defining $$ f(x)= f(x;\mu,\sigma)= \frac{1}{\sigma} f_0\left(\frac{x-\mu}{\sigma}\right) $$ where $\sigma > 0$ is a scale parameter and $\mu$ is a location parameter. In the measurement error model, where usual the error term is modeled as a normal distribution, we can in the place of that normal distribution use a location-scale family as constructed above. When $f_0$ is the standard normal distribution, the construction above gives the $\text{N}(\mu, \sigma)$ family.

Now we will use IRLS on some simple examples. First we will find the ML (maximum likelihood) estimators in the model $$ Y_1,Y_2,\ldots,Y_n \hspace{1em} \text{i.i.d} $$ with the density $$ f(y)= \frac{1}{\pi} \frac{1}{1+(y-\mu)^2},\hspace{1em} y\in{\mathbb R}, $$ the Cauchy distribution the location family $\mu$ (so this is a location family). But first some notation. The weighted least squares estimator of $\mu$ is given by $$ \mu^{\ast} = \frac{\sum_{i=1}^n w_i y_i} {\sum_{i=1}^n w_i}. $$ where $w_i$ is some weights. We will see that the ML estimator of $\mu$ can be expressed in the same form, with $w_i$ some function of the residuals $$ \epsilon_i = y_i-\hat{\mu}. $$ The likelihood function is given by $$ L(y;\mu)= \left(\frac{1}{\pi}\right)^n \prod_{i=1}^n \frac{1}{1+(y_i-\mu)^2} $$ and the loglikelihood function is given by $$ l(y)= -n \log(\pi) - \sum_{i=1}^n \log\left(1+(y_i-\mu)^2\right). $$ Its derivative with respect to $\mu$ is $$ \begin{eqnarray} \frac{\partial l(y)}{\partial \mu}&=& 0-\sum \frac{\partial}{\partial \mu} \log\left(1+(y_i-\mu)^2\right) \nonumber \\ &=& -\sum \frac{2(y_i-\mu)}{1+(y_i-\mu)^2}\cdot (-1) \nonumber \\ &=& \sum \frac{2 \epsilon_i}{1+\epsilon_i^2} \nonumber \end{eqnarray} $$ where $\epsilon_i=y_i-\mu$. Write $f_0(\epsilon)= \frac{1}{\pi} \frac{1}{1+\epsilon^2}$ and $f_0'(\epsilon)=\frac{1}{\pi} \frac{-1\cdot 2 \epsilon}{(1+\epsilon^2)^2}$, we get $$ \frac{f_0'(\epsilon)}{f_0(\epsilon)} = \frac{\frac{-1 \cdot2\epsilon}{(1+\epsilon^2)^2}} {\frac{1}{1+\epsilon^2}} = -\frac{2\epsilon}{1+\epsilon^2}. $$ We find $$ \begin{eqnarray} \frac {\partial l(y)} {\partial \mu} & =& -\sum \frac {f_0'(\epsilon_i)} {f_0(\epsilon_i)} \nonumber \\ &=& -\sum \frac {f_0'(\epsilon_i)} {f_0(\epsilon_i)} \cdot \left(-\frac{1}{\epsilon_i}\right) \cdot (-\epsilon_i) \nonumber \\ &=& \sum w_i \epsilon_i \nonumber \end{eqnarray} $$ where we used the definition $$ w_i= \frac{f_0'(\epsilon_i)} {f_0(\epsilon_i)} \cdot \left(-\frac{1}{\epsilon_i}\right) = \frac{-2 \epsilon_i} {1+\epsilon_i^2} \cdot \left(-\frac{1}{\epsilon_i}\right) = \frac{2}{1+\epsilon_i^2}. $$ Remembering that $\epsilon_i=y_i-\mu$ we obtain the equation $$ \sum w_i y_i = \mu \sum w_i, $$ which is the estimating equation of IRLS. Note that

The weights $w_i$ are always positive.
If the residual is large, we give less weight to the corresponding observation.

To calculate the ML estimator in practice, we need a start value $\hat{\mu}^{(0)}$, we could use the median, for example. Using this value we calculate residuals $$ \epsilon_i^{(0)} = y_i - \hat{\mu}^{(0)} $$ and weights $$ w_i^{(0)} = \frac{2}{1+\epsilon_i^{(0)} }. $$ The new value of $\hat{\mu}$ is given by $$ \hat{\mu}^{(1)} = \frac{\sum w_i^{(0)} y_i} {\sum w_i^{(0)} }. $$ Continuing in this way we define $$ \epsilon_i^{(j)} = y_i- \hat{\mu}^{(j)} $$ and $$ w_i^{(j)} = \frac{2}{1+\epsilon_i^{(j)} }. $$ The estimated value at the pass $j+1$ of the algorithm becomes $$ \hat{\mu}^{(j+1)} = \frac{\sum w_i^{(j)} y_i} {\sum w_i^{(j)} }. $$ Continuing until the sequence $$ \hat{\mu}^{(0)}, \hat{\mu}^{(1)}, \ldots, \hat{\mu}^{(j)}, \ldots $$ converges.

Now we studies this process with a more general location and scale family, $f(y)= \frac{1}{\sigma} f_0(\frac{y-\mu}{\sigma})$, with less detail. Let $Y_1,Y_2,\ldots,Y_n$ be independent with the density above. Define also $ \epsilon_i=\frac{y_i-\mu}{\sigma}$. The loglikelihood function is $$ l(y)= -\frac{n}{2}\log(\sigma^2) + \sum \log(f_0\left(\frac{y_i-\mu}{\sigma}\right)). $$ Writing $\nu=\sigma^2$, note that $$ \frac{\partial \epsilon_i}{\partial \mu} = -\frac{1}{\sigma} $$ and $$ \frac{\partial \epsilon_i}{\partial \nu} = (y_i-\mu)\left(\frac{1}{\sqrt{\nu}}\right)' = (y_i-\mu)\cdot \frac{-1}{2 \sigma^3}. $$ Calculating the loglikelihood derivative $$ \frac{\partial l(y)}{\partial \mu} = \sum \frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \frac{\partial \epsilon_i}{\partial \mu} = \sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot\left(-\frac{1}{\sigma}\right)= -\frac{1}{\sigma}\sum\frac{f_o'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \left(-\frac{1}{\epsilon_i}\right)(-\epsilon_i) = \frac{1}{\sigma}\sum w_i \epsilon_i $$ and equaling this to zero gives the same estimating equation as the first example. Then searching for an estimator for $\sigma^2$: $$ \begin{eqnarray} \frac{\partial l(y)}{\partial \nu} &=& -\frac{n}{2}\frac{1}{\nu} + \sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \frac{\partial \epsilon_i}{\partial\nu} \nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu}+\sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)} \cdot \left(-\frac{(y_i-\mu)}{2\sigma^3}\right) \nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu} - \frac{1}{2}\frac{1}{\sigma^2} \sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \epsilon_i\nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu}-\frac{1}{2}\frac{1}{\nu} \sum\frac{f_0'(\epsilon_i)}{f_0(\epsilon_i)}\cdot \left(-\frac{1}{\epsilon_i}\right) (-\epsilon_i)\cdot\epsilon_i\nonumber \\ &=& -\frac{n}{2}\frac{1}{\nu}+\frac{1}{2}\frac{1}{\nu}\sum w_i \epsilon_i^2 \stackrel{!}{=} 0. \nonumber \end{eqnarray} $$ leading to the estimator $$ \hat{\sigma^2} = \frac{1}{n}\sum w_i (y_i-\hat{\mu})^2. $$ The iterative algorithm above can be used in this case as well.

In the following we give a numerical example using R, for the double exponential model (with known scale) and with data y <- c(-5,-1,0,1,5). For this data the true value of the ML estimator is 0. The initial value will be mu <- 0.5. One pass of the algorithm is

      iterest <- function(y, mu) {
                   w <- 1/abs(y - mu)
                   weighted.mean(y, w)
                   }

with this function you can experiment with doing the iterations "by hand" Then the iterative algorithm can be done by

    mu_0 <- 0.5
    repeat {mu <- iterest(y, mu_0)
            if (abs(mu_0 - mu) < 0.000001) break
            mu_0 <- mu }

Exercise: If the model is a $t_k$ distribution with scale parameter $\sigma$ show the iterations are given by the weight $$ w_i = \frac{k + 1}{k + \epsilon_i^2}. $$ Exercise: If the density is logistic, show the weights are given by $$ w(\epsilon) = \frac{ 1-e^\epsilon}{1+e^\epsilon} \cdot - \frac{1}{\epsilon}. $$

For the moment I will leave it here, I will continue this post.

Regression – Do GEE and GLM Estimate the Same Coefficients?

Yes. GEE and GLM will indeed have the same coefficients, but different standard errors. To check, run an example in R. I've taken this example from Chapter 25 of Applied Regression Analysis and Other Multivariable Methods, 5th by Kleinbaum, et. al (just because it's on my desk and references GEE and GLM):

library(geepack)
library(lme4)

#get book data from 
mydf<-read.table("http://www.hmwu.idv.tw/web/bigdata/rstudio-readData/tab/ch25q04.txt", header=TRUE)
mydf<-data.frame(subj=mydf$subj, week=as.factor(mydf$week), fev=mydf$fev)
#Make 5th level the reference level to match book results
mydf$week<-relevel(mydf$week, ref="5")

#Fit GLM Mixed Model
mixed.model<-summary(lme4::lmer(fev~week+(1|subj),data=mydf))
mixed.model$coefficients

                Estimate Std. Error     t value
(Intercept)  6.99850  0.2590243 27.01870247
week1        2.81525  0.2439374 11.54087244
week2       -0.15025  0.2439374 -0.61593680
week3        0.00325  0.2439374  0.01332309
week4       -0.04700  0.2439374 -0.19267241

#Fit a gee model with any correlation structure.  In this case AR1
gee.model<-summary(geeglm(fev~week, id=subj, waves=week, corstr="ar1", data=mydf))
gee.model$coefficients

            [Estimate   Std.err         Wald  Pr(>|W|)
(Intercept)  6.99850 0.2418413 8.374312e+02 0.0000000
week1        2.81525 0.2514376 1.253642e+02 0.0000000
week2       -0.15025 0.2051973 5.361492e-01 0.4640330
week3        0.00325 0.2075914 2.451027e-04 0.9875090
week4       -0.04700 0.2388983 3.870522e-02 0.8440338][1]

UPDATE

As Mark White pointed out in his comment, I did indeed previously fit a "single-level" Mixed Effects GLM. Since you didn't specify whether you wanted a "fixed effects" or "random" effects GLM model, I just picked "random" since that's the model fit in the book I selected from. But indeed, Mark is right that the coefficients do not necessarily agree in multilevel models, and someone provided a nice answer about that question previously. For your reference, I've added a "fixed" effects GLM model below using lm.

#Fit Traditional GLM Fixed Effect Model (i.e. not Random effects)
glm.fixed<-summary(lm(fev~week, data=mydf))
glm.fixed$coefficients
            Estimate Std. Error     t value     Pr(>|t|)
(Intercept)  6.99850  0.2590243 27.01870247 7.696137e-68
week1        2.81525  0.3663157  7.68531179 7.287752e-13
week2       -0.15025  0.3663157 -0.41016538 6.821349e-01
week3        0.00325  0.3663157  0.00887213 9.929302e-01
week4       -0.04700  0.3663157 -0.12830465 8.980401e-01

Note the first and second columns of the output in each model. They coefficients are identity, but standard errors differ.

You also added a comment which asked, "And does this remain the case when we choose a non-linear link function?" Note first that this is a different question since non-linear link functions generally aren't General Linear Models but Generalized Linear models. In this case, the coefficients do not necessarily match. Here's an example again in R:

#Fit Generalized Linear Mixed Effects Model with, say, Binomail Link
nlmixed.model<-summary(lme4::glmer(I(mydf$fev>mean(mydf$fev))~week+(1|subj), family="binomial", data=mydf))
nlmixed.model$coefficients

#Fit GEE model with, say, Binomial Link
nlgee.model<-summary(geeglm(I(mydf$fev>mean(mydf$fev))~week, id=subj, waves=week, family="binomial", data=mydf))
nlgee.model$coefficients