[Math] Does the expected value of the derivative of the log likelihood evaluated at a certain parameter always have negative mean

probabilitystatistics

It is well known that the derivative of the log likelihood with respect to the parameter of interest (the score) has zero expected value.

Assuming $f(z;\theta)$ is a probability density function the quick version of the proof (skipping the technicalities in swapping the derivative and the integral) is

$$\int \frac{\partial f(z;\theta)}{\partial \theta}dz=0 \Leftrightarrow \int \frac{f(z;\theta)}{f(z;\theta)} \frac{\partial f(z;\theta)}{\partial \theta}dz=0\Leftrightarrow \int f(z;\theta) \frac{\partial \log f(z;\theta)}{\partial \theta}dz=0$$

I was wondering if taking $\theta_1 > \theta$ the expected value of the score evaluated at $\theta_1$

$$\int f(z;\theta) \frac{\partial \log f(z;\theta_1)}{\partial \theta_1}dz$$

was always negative. I have tried this for the mean and variance of the normal distribution and for the parameter of the exponential distribution and it holds. Here is the derivation for the exponential:

Assume $X \sim exp( \lambda)$.
Take $\lambda_1 > \lambda$
the derivative of the log likelihood of the exponential density with one observation is $$S(\lambda, x):=\frac{\partial }{ \partial \lambda} (\log(\lambda) – \lambda x) = \frac{1}{\lambda} – x$$

So
$$E[S(\lambda_1, X)] = E \left[ \frac{1}{\lambda_1} – X \right] = \frac{1}{\lambda_1} – \frac{1}{\lambda} < 0.$$

does this hold in general?

EDIT: I was working on a proof only for the exponential family of distributions but couldn't quite make it, even a subcase like the exponential family would be interesting to me.

EDIT2: after thinking about this for a while I think an equivalent way to state the problem is: "when is the maximum likelihood estimator unbiased". So, when is it unbiased?

Best Answer

Your question can be rephrased somewhat into 'does the expected value of the derivative of the log-likelihood always point towards the correct value?' (if it doesn't then you can turn it into a counter example of your hypothesis by optionally flipping the sign of $\theta$).

This won't be true in general, you could for instance come up with a distribution like:

$$ \sin(x + \theta)^2 / (1+x^2) $$

which is periodic in $\theta$, clearly the derivative at $\theta + 2\pi$ must be equal to the one at $\theta$, and clearly $E_\theta[S(\theta_1, X)] = E_{\theta+2\pi}[S(\theta_1, X)]$ so both can't point to the 'correct' value at the same time.

However having a probability distribution where several $\theta$ are equivalent is clearly not usually the case. So we need to require some kind of 'unimodality'. To see what kind we need it's instructive to pull the derivative outside of the expectation:

$$ \begin{align} \int f(z;\theta) \frac{\partial \log f(z;\theta_1)}{\partial \theta_1} \,\mathrm{d}z &=\frac{\partial}{\partial \theta_1} \int f(z;\theta) \log f(z;\theta_1) \,\mathrm{d}z \end{align} $$

so now we're looking at the (negative) derivative of the cross entropy (which is also the derivative of the Kullback-Leibler divergence), which is a measure of how close the distribution $f(z;\theta_1)$ is to the 'true' distribution $f(z;\theta)$. It's now clear why its derivative is usually pointing the right way, since we'd generally expect the model to get better if the parameters are closer to their actual values.

Anyway from this we can extract, a sufficient, but maybe not necessary condition, which is for the probability distribution to be log concave (i.e. $\log(f(z;\theta_1))$ is concave w.r.t. $\theta_1$), in that case it's expected value

$$ \int f(z;\theta) \log f(z;\theta_1) \,\mathrm{d}z $$

is also concave, which in particular means that it's derivative is monotonically non-increasing and is $0$ at $\theta_1 = \theta$, this is enough to conclude that $E_{\theta}[S(\theta_1, X)]$ is pointing towards $\theta$.

The exponential distribution and normal distribution are all log-concave w.r.t all their parameters, but keep in mind that most distributions are called log-concave when they're log-concave w.r.t to the value (here $z$) not the parameters (here $\theta_1$).

Related Question