When I first learned about Generalized Linear Models I thought that the assumption that the dependent variable follows some distribution from the exponential family was made to simplify calculations. However, I now read about Vector GLMs (VGLMs). VGLMs do not require the assumption that the dependent variable follows some distribution from the exponential family but they allow for a much broader set of distributions.
So my question is: WHY do we actually need the distribution assumption in GLMs?
My thoughts so far: GLMs model the mean of the assumed exponential family and thus has only one predictor (this predictor may be vector-valued in case of a vector-valued distribution mean). The variance of the distribution depends on the mean by some function and the first two moments specify the distribution uniquely within the set of all distributions from the exponential family. Thus, it is enough to specify the link function to uniquely specify the distribution. VGLMs on the other hand allow more than one predictor, one predictor for each parameter. It is therefore possible to specify the distribution by first assuming the distribution of the dependent variable and then estimate the parameters. Consider for instance the negative binomial distribution $NB(r,\mu)$. The two parameters are and $r$ (number of trials) and the mean $\mu$ (note that in this formulation $p=\frac{\mu}{\mu+r}$). Can someone verify these thoughts or give another explanation?
Best Answer
When I discovered GLM I also wondered why it was always based on the exponential family. I have never answered to that question clearly. But...
I call $h$ the reciprocal of the link function. $\beta$ the parameter.
Yes. I used it with stochastic gradient descent (SGD), and the update rule of SGD (the gradient) is made especially simple in the canonical GLM case. See http://proceedings.mlr.press/v32/toulis14.pdf prop 3.1 and paragraph 3.1. Finally it all works in a way that is similar to least squares (minimize average $(Y-h(\beta X))^2$) but even simpler. The interpretation of the update rule is made quite simple. For some sample $(x,y)$ :
Without the exp family and canonical link, the error would be multiplied by something dependant on $x$ (and maybe $y$). It would be a sort of refinement of the basic idea : varying the intensity of the correction. It gives different weights to the samples. With least square, you have to multiply by $h'(\beta x)$. Some practical tests of mine in a case with a lot of data showed it was less good (for reasons I'm incapable of explaining).
Yes again.
Also pre-existing logistic regression and Poisson regression fit into the canonical GLM framework. Probably one more (historical) explanation of using the exp family + canonical link.
Maybe, "why assume the exp family in GLM" is similar to "why assume a normal noise in linear regression". For theoretical good properties and simple calculations... But does it always matter so much in practice ? Real data rarely have normal noise in cases when linear regression still works very well.
What was fundamentally useful (for me) about GLM is the difference with transformed linear regression :
This changes everything :
I'm not familiar with VGLM so I can't answer about it.