Let the density of the distribution of response $y_i | x_i$ in GLMs denote as:

$$f(y; \theta, \phi) = \exp\left(\frac{y\theta – b(\theta)}{\phi} + c(y; \phi)\right)$$

Then conditional expectation and variance are given as: $E[y_i | x_i] = \mu_i = b'(\theta_i)$ and $\text{Var}[y_i | x_i] = \phi \cdot b''(\theta_i)$. Furthermore the variance function is given as $V[\mu] = b''(\theta(\mu))$.

I checked with Poisson and Bernoulli distribution and it really seems to hold. But I am puzzled on where those derivatives are coming from and why that should hold?

Any input is welcome!

## Best Answer

I have always wondered about this myself as well. There are proofs for it like using the score in the other answer, or the moment generating function like mentioned in the comments. But they are indirect and it is not directly intuitive that the expectation value equals the derivative of the log normalisation factor $b(\theta)$. I hope that this post may help a bit to see more directly why the mean and variance are related to the normalisation factor.

## Preliminary definitions

We can simplifying the question with $\phi = 1$ (without loss of generality). Then

$$f(y; \theta) \propto e^{\theta y} h(y)$$

or with a normalisation factor

$$f(y; \theta) = \frac{ e^{\theta y} h(y)}{\int_{\forall y} e^{\theta y} h(y) \, \text{d}y} = \frac{ e^{\theta y} h(y)}{z(\theta)} $$

where $z(\theta) = e^{b(\theta)}$ and $h(y) = e^{c(y)}$ are the normalisation and base measure but taken outside of the exponential function. Also note that this base measure $h$ is in fact the same as the unnormalized density when $\theta = 0$, (then the exponential factor equals 1, $e^{0y} = 1$ for all values $y$).

## Intuitive viewpoint

This factor $e^{\theta y}$ can be regarded as some energy/temperature term as in the Boltzmann distribution. If $\theta$ is larger then larger values $y$ will become relatively more probable.

For example, in terms of the odds ratios, these scale with a factor $e^{\theta (y_1-y_2)}$

$$\frac{f(y_1;\theta)}{f(y_2;\theta)} = \frac{f(y_1;0)}{f(y_2;0)} e^{\theta (y_1-y_2)}$$

The graph below might show this more intuitively

The image uses three different truncated exponential distributions (different shades) with different domains, [-2.5, -1.5], [0, 1] and [6,7].

The effect of the term $e^{\theta y}$ is shown in the second image. It can be seen that this changes the area of the function and the amount of change depends on the values $y$. If we increase $\theta$ then at negative densities the area decreases, at positive values the area increases. The decrease/increase is stronger the further away we are from zero.

So in a way, the change of the area as we change the parameter $\theta$ is related to where the probability density is distributed. So that is why the normalisation factor $z(\theta)$ or it's logarithm $b(\theta)$ is related to the mean (and the variance).

## Exact computation

The constant of proportionality is found by integrating (or summing) over the range of the variable $Y$

$$e^{b(\theta)} = z(\theta) = \int_{\Omega_Y} e^{\theta y} h(y) dy$$

and by differentiation under the integration sign

$$\begin{array}{rcl} z'(\theta) &=& \int_{\Omega_Y} y e^{\theta y} h(y) dy\\ z''(\theta) &=& \int_{\Omega_Y} y^2 e^{\theta y} h(y) dy \end{array}$$

With the above three functions we can express the mean and variance.

$$\begin{array}{rcl} E[y] &=& \frac{\int_{\Omega_Y} y e^{\theta y} h(y) dy}{\int_{\Omega_Y} e^{\theta y} h(y) dy} \\ &=& \frac{z'(\theta)}{z(\theta)}\\ & =& \left[\log \, z(\theta)\right]'\\ & =& b(\theta)' \\ \phantom{1} \\ \text{Var}[y] &=& \frac{\int_{\Omega_Y} y^2 e^{\theta y} h(y) dy}{\int_{\Omega_Y} e^{\theta y} h(y) dy} - \left( \frac{\int_{\Omega_Y} y e^{\theta y} h(y) dy}{\int_{\Omega_Y} e^{\theta y} h(y) dy} \right)^2 \\ &=& \ \frac{z''(\theta)}{z(\theta)} - \left( \frac{z'(\theta)}{z(\theta)} \right)^2\\ & =& \left[\log \, z(\theta)\right]''\\ & =& b(\theta)'' \end{array}$$