Solved – “weight” input in glm and lm functions in R

generalized linear modellikelihoodlmrweighted-regression

I am confused with the definition of the weights in glm and lm.

Using the McCullagh and Nelder (1989)'s notation, If random variable $y_i$ is from the Generalized Linear Model (GLM), then its density is modelled in the form:

\begin{equation}
f(y_i) = exp\Big(\frac{m_i}{\phi} [\theta_i y_i – b(\theta_i) ] + c(y_i;\phi)\Big)
\end{equation}

where $\theta_i$ is the canonical parameter, $\kappa$ is the dispersion parameter and $m$ is the known prior "weight". I would like to know that this prior "weight" is NOT the weight specified in glm. help(glm) says that:

Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations. For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes: they would rarely be used for a Poisson GLM.

Therefore, in my understanding, what "weight" $w_i$ does is to re-parameterize the dispersion parameter as

$$\phi=\frac{\phi^*}{w_i},$$

where $\phi^*$ is the redefined dispersion parameter.
This means that for example, when $y_i$ is modelled only with an intercept term $\beta_0$,
lm function with non NULL "weight" specification maximizes the sum of the weighted likelihood of $y_i$ with respect to $\phi^*$ and $\beta_0$ where:

$$
f(y_i)=\sqrt{ \frac{w_i}{2\pi \phi^*} } \exp\Big(-\frac{1}{2}\frac{w_i (y-\beta_0)^2}{\phi^*}\Big),
$$
where the identity link is used $\beta_0=\theta_i$.

Similarly, glm function with family = "poisson" with non NULL "weight" maximizes the sum of the weighted likelihood of $y_i$ with respect to $\beta_0$ where:

$$
f(y_i)=\frac{\beta_0^{w_i y_i}}{y_{i}!} exp(-w_i \beta_0),
$$

where the log link is used $\beta_0=exp(\theta_i)$.

Similarly, glm function with family = "binomial" with non NULL "weight" maximizes the sum of the weighted likelihood of $y_i$ with respect to $\phi^*$ and $\beta_0$ where:

$$
f(y_i)=
\begin{pmatrix}
m\\
y_i
\end{pmatrix}
\beta_0^{w_iy_i}(1-\beta_0)^{w_i(m-y_i)}
$$

where logit link is used $\beta_0 = logit^{-1}(\theta_i)$.

Is my understanding correct?

Reference:

C.E. McCulloch and J.A. Nelder. Generalized Linear Models. Chapman and Hall, London,
1989.

Best Answer

I found a reference supporting my understanding of the weight in glm.

The book "Modern Applied Statics with S" written by W.N Venables and B.D Ripley (Fourth edition) defines GLM model for $y_i$ as:

$$ f(y_i;\theta_i, \phi)=\exp \Big( \frac{A_i (y_i\theta_i-b(\theta_i))}{\phi}+c(y_i,\phi/A_i)\Big) $$

(page 183, equation 7.1). Then the page 188 says

"Prior weights $A_i$ may be specified using weight argument."

Related Solutions

Solved – R: glm function with family = “binomial” and “weight” specification

Your example is merely causing rounding error in R. Large weights don't perform well in glm. It's true that scaling w by virtually any smaller number, like 100, leads to same estimates as the unscaled w.

If you want more reliable behavior with the weights arguments, try using the svyglm function from the survey package.

See here:

    > svyglm(Y~1, design=svydesign(ids=~1, weights=~w, data=data.frame(w=w*1000, Y=Y)), family=binomial)
Independent Sampling design (with replacement)
svydesign(ids = ~1, weights = ~w, data = data.frame(w = w * 1000, 
    Y = Y))

Call:  svyglm(formula = Y ~ 1, design = svydesign(ids = ~1, weights = ~w2, 
    data = data.frame(w2 = w * 1000, Y = Y)), family = binomial)

Coefficients:
(Intercept)  
     -2.197  

Degrees of Freedom: 3 Total (i.e. Null);  3 Residual
Null Deviance:      2.601 
Residual Deviance: 2.601    AIC: 2.843

Solved – How many distributions are in the GLM

As you indicate, the qualification for using a distribution in a GLM is that it be of the exponential family (note: this is not the same thing as the exponential distribution! Although the exponential distribution, as a gamma distribution, is itself part of the exponential family). The five distributions you list are all of this family, and more importantly, are VERY common distributions, so they are used as examples and explanations.

As Zhanxiong notes, the uniform distribution (with unknown bounds) is a classic example of a non-exponential family distribution. shf8888 is confusing the general uniform distribution, on any interval, with a Uniform(0, 1). The Uniform(0,1) distribution is a special case of the beta distribution, which is an exponential family. Other non-exponential family distributions are mixture models and the t distribution.

You have the definition of the exponential family correct, and the canonical parameter is very important for using GLM. Still, I've always found it somewhat easier to understand the exponential family by writing it as:

$$f(x; \theta) = a(\theta)g(x)\exp\left[b(\theta)R(x)\right]$$

There is a more general way to write this, with a vector $\boldsymbol{\theta}$ instead of a scalar $\theta$; but the one-dimensional case explains a lot. Specifically, you must be able to factor your density's non-exponentiated part into two functions, one of unknown parameter $\theta$ but not observed data $x$ and one of $x$ and not $\theta$; and the same for the exponentiated part. It may be hard to see how, e.g., the binomial distribution can be written this way; but with some algebraic juggling, it becomes clear eventually.

We use the exponential family because it makes a lot of things much easier: for instance, finding sufficient statistics and testing hypotheses. In GLM, the canonical parameter is often used for finding a link function. Finally, a related illustration of why statisticians prefer to use the exponential family in just about every case is trying to do any classical statistical inference on, say, a Uniform($\theta_1$, $\theta_2$) distribution where both $\theta_1$ and $\theta_2$ are unknown. It's not impossible, but it's much more complicated and involved than doing the same for exponential family distributions.

Best Answer

Related Solutions

Solved – R: glm function with family = “binomial” and “weight” specification

Solved – How many distributions are in the GLM

Related Question