Exponential Family – Understanding Observed vs. Expected Sufficient Statistics in Exponential Family

exponential-familymaximum likelihoodself-studysufficient-statistics

My question arises from reading reading Minka's "Estimating a Dirichlet Distribution", which states the following without proof in the context of deriving a maximum-likelihood estimator for a Dirichlet distribution based on observations of random vectors:

As always with the exponential family, when the gradient is zero, the expected sufficient statistics are equal to the observed sufficient statistics.

I haven't seen maximum likelihood estimation in the exponential family presented in this way, nor have I found any suitable explanations in my search. Can someone offer insight into the relationship between observed and expected sufficient statistics, and perhaps help to understand maximum likelihood estimation as minimizing their difference?

Best Answer

This is a usual assertion about the exponential family, but in my opinion, most of the times it is stated in a way that may confuse the less experienced reader. Because, taken at face value, it could be interpreted as saying "if our random variable follows a distribution in the exponential family, then if we take a sample and insert it into the sufficient statistic, we will obtain the true expected value of the statistic". If only it were so... More over it does not take into account the size of the sample, which may cause further confusion.

The exponential density function is

$$f_X(x) = h(x)e^{\eta(\theta) T(x)}e^{-A(\theta)} \tag{1}$$

where $T(x)$ is the sufficient statistic.

Since this is a density, it has to integrate to unity, so ($S_x$ is the support of $X$)

$$\int_{S_x} h(x)e^{\eta(\theta) T(x)}e^{-A(\theta)}dx =1 \tag{2}$$

Eq. $(2)$ holds for all $\theta$ so we can differentiate both sides with respect to it:

$$\frac {\partial}{\partial \theta} \int_{S_x} h(x)e^{\eta(\theta) T(x)}e^{-A(\theta)}dx =\frac {\partial (1)}{\partial \theta} =0 \tag{3}$$

Interchanging the order of differentiation and integration, we obtain

$$\int_{S_x} \frac {\partial}{\partial \theta} \left(h(x)e^{\eta(\theta) T(x)}e^{-A(\theta)}\right)dx =0 \tag{4}$$

Carrying out the differentiation we have

$$\frac {\partial}{\partial \theta} \left(h(x)e^{\eta(\theta) T(x)}e^{-A(\theta)}\right) = f_X(x)\big[T(x)\eta'(\theta) - A'(\theta)\big] \tag{5}$$

Inserting $(5)$ into $(4)$ we get

$$\int_{S_x} f_X(x)\big[T(x)\eta'(\theta) - A'(\theta)\big]dx =0 $$

$$\Rightarrow \eta'(\theta)E[T(X)] - A'(\theta) = 0 \Rightarrow E[T(X)] = \frac {A'(\theta)}{\eta'(\theta)} \tag{6}$$

Now we ask: the left-hand-side of $(6)$ is a real number. So, the right-hand-side must also be a real number, and not a function. Therefore it must be evaluated at a specific $\theta$, and it should be the "true" $\theta$, otherwise in the left-hand-side we would not have the true expected value of $T(X)$. To emphasize this we denote the true value by $\theta_0$, and we re-write $(6)$ as

$$E_{\theta_0}[T(X)] = \frac {A'(\theta)}{\eta'(\theta)}\Big |_{\theta =\theta_0} \tag{6a}$$

We turn now to maximum likelihood estimation. The log-likelihood for a sample of size $n$ is

$$L(\theta \mid \mathbf x) = \sum_{i=1}^n\ln h(x_i) +\eta(\theta)\sum_{i=1}^nT(x_i) -nA(\theta)$$

Setting its derivative with respect to $\theta$ equal to $0$ we obtain the MLE

$$\hat \theta(x) : \frac 1n\sum_{i=1}^nT(x_i) = \frac {A'(\theta)}{\eta'(\theta)}\Big |_{\theta =\hat \theta(x)} \tag {7}$$

Compare $(7)$ with $(6a)$. The right-hand sides are not equal, since we cannot argue that the MLE estimator hit upon the true value. So neither are the left hand-sides. But remember that eq. $2$ holds for all $ \theta$ and so for $\hat \theta$ also. So the steps in eq. $3,4,5,6$ can be taken with respect to $\hat \theta$ and so we can write eq. $6a$ for $\hat \theta$:

$$E_{\hat\theta(x)}[T(X)] = \frac {A'(\theta)}{\eta'(\theta)}\Big |_{\theta =\hat\theta(x)} \tag{6b}$$

which, combined with $(7)$, leads us to the valid relation

$$ E_{\hat\theta(x)}[T(X)] = \frac 1n\sum_{i=1}^nT(x_i)$$

which is what the assertion under examination really says: the expected value of the sufficient statistic under the MLE for the unknown parameters (in other words, the value of the first raw moment of the distribution that we will obtain if we use $\hat \theta(x)$ in place of $\theta$), equals (and it is not just approximated by) the average of the sufficient statistic as calculated from the sample $\mathbf x$.

Moreover, only if the sample size is $n=1$ then we could accurately say, "the expected value of the sufficient statistic under the MLE equals the sufficient statistic".