Exponential Family – Why the Exponential Family is Important in Statistics

bayesianexponential-familylogisticprobabilityregression

Why is the exponential family so important in statistics?

I was recently reading about the exponential family within statistics. As far as I understand, the exponential family refers to any probability distribution function that can be written in the following format (notice the "exponent" in this equation):

Enter image description here

This includes common probability distribution functions such as the normal distribution, the gamma distribution, the Poisson distribution, etc. Probability distributions from the exponential family are often used as the "link function" in regression problems (e.g., in count data settings, the response variable can be related to the covariates through a Poisson distribution) – probability distribution functions that belong to the exponential family are often used due to their "desirable mathematical properties". For example, these properties are the following:

Enter image description here

Why are these properties so important?

A) The first property is about "sufficient statistics". A "sufficient statistic" is a statistic that provides more information for any given data set/model parameter compared to any other statistic.

I am having trouble understanding why this is important. In the case of logistic regression, the logit link function is used (part of the exponential family) to link the response variable with the observed covariates. What exactly are the "statistics" in this case (e.g.. in a logistic regression model, do these "statistics" refer to the "mean" and "variance" of the beta-coefficients of the regression model)? What are the "fixed values" in this case?

B) Exponential families have conjugate priors.

In the Bayesian setting, a prior p(thetha | x) is called a conjugate prior if it is in the same family as the posterior distribution p(x | thetha). If a prior is a conjugate prior – this means that a closed form solution exists and numerical integration techniques (e.g., MCMC) are not required to sample the posterior distribution. Is this correct?

C) Is the third property essentially similar to the second property?

D) I don't understand the fourth property at all. Variational Bayes are an alternative to MCMC sampling techniques that approximate the posterior distribution with a simpler distribution – this can save computational time for high dimensional posterior distributions with big data. Does the fourth property mean that variational Bayes with conjugate priors in the exponential family have closed form solutions? So any Bayesian model that uses the exponential family does not require MCMC – is this correct?

References:

Best Answer

Excellent questions.

Regarding A: A sufficient statistic is nothing more than a distillation of the information that is contained in the sample with respect to a given model. As you would expect, if you have a sample $x_i \sim N(\mu,\sigma^2)$ for $i \in \{1, \ldots, N\}$ and each independent, it is clear that so long as we calculate the sample mean and sample variance, it doesn't matter what the values of each $x_i$ are. In linear regression (easier to talk about than logistic in this context), the sampling distribution of the unknown coefficient vector (for known variance) is $N(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, \sigma^2\mathbf{X}^\top\mathbf{X})^{-1})$, so as long as these final quantities are identical, inference based thereupon while be too. This is the idea of sufficiency.

Note that in the $N(\mu,\sigma^2)$ example, the sufficient statistic comprises of just two numbers: $\hat{\mu}=\frac{1}{N}\sum_{i=1}^N x_i$ and $\frac{1}{N}\sum_{i=1}^N (x_i-\hat{\mu})^2$, no matter how big our sample size $N$ is (and assuming $N>2$). Likewise, the vector $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is of dimension $P$ and $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ of dimension $P\times P$ (here $P$ is the dimension of the design matrix), which are both independent of $N$ (though, technically, the matrix $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ is just a constant under our assumptions). So in these examples, the sufficient statistic has a fixed number of values (not fixed values), or as I would put it, fixed dimension.

Let's note three more things. First, that there is no such thing as the sufficient statistic for a distribution, rather, there are many possible statistics which may be sufficient, and which may be of different dimension. Indeed, our second thing to discuss is that the entire sample itself, since it naturally contains all information contained in itself, is always a sufficient statistic. This is a trivial case, but an important one, as in general one cannot always expect to find a sufficient statistic of dimension less than $N$. And the final thing to note is model specificity: that's why I wrote with respect to a given model above. Changing your likelihood will change the sufficient statistics, at least potentially, for a given dataset.

Regarding B: What you're saying is correct, but additionally to allowing analytic posteriors in the univariate case, conjugacy has serious benefits in the context of Bayesian hierarchical models estimated via MCMC. This is because conditional posteriors are also available in closed form. So we can actually accelerate Metropolis-within-Gibbs style MCMC algos with conjugacy.

Regarding C: It's definitely a similar idea, but I do want to make clear that we're talking about two different distributions here: "posterior" versus "posterior predictive". As the name implies, both of these are posterior distributions, which means that they are distributions of an unknown variable conditioned on our known data. A "posterior" plain and simple usually refers to something like $P(\mu, \sigma^2| \{x_1, \ldots, x_N\})$ from our normal example above: a distribution of unkown parameters defined in the data generating distributions. In contrast, a "posterior predictive" gives the distribution of a hypothetical $N+1$'st data point $x_{N+1}$ conditional on the observed data: $P(x_{N+1}| \{x_1, \ldots, x_N\})$. Notice that this is not conditional on the parameters $\mu$ and $\sigma^2$: they had to be integrated out. It is this additional integral that is guaranteed by conjugacy.

Regarding D: In the context of Variational Bayes (VB), you have some posterior distribution $P(\theta|X)$ where $\theta$ is some vector of $P$ parameters and $X$ are some data. Rather than trying to generate a sample from it like MCMC, we are instead going to use an approximate posterior distribution that's easy to work with and pretty close to the true one. That's called a variational distribution and is denoted $Q_\eta(\theta)$. Notice that our variational distribution is indexed by variational parameters $\eta$. Variational parameters are nothing like the parameters we do Bayesian inference on, and are nothing like our data. They don't have a distribution associated with them and they don't have some hypothetical role generating the data. Rather, they are chosen as a result of an iterative optimization algorithm. The whole idea of variational inference is to define some measure of dissimilarity between the variational distribution and the true posterior and then minimize that measure with respect to the parameters $\eta$. We'll denote the result of that optimization process by $\hat{\eta}(X)$. At that point, hopefully $Q_{\hat{\eta}(X)}(\theta)$ is pretty close to $P(\theta|X)$, and if we do inferences using $Q_{\hat{\eta}(X)}(\theta)$ instead we'll get similar answers.

Now where does conjugacy fit in? A popular way to measure dissimilarity is this measure, which is called the reverse KL cost:

$$ \hat{\eta}(X) := \underset{\eta}{\textrm{argmin}}\, \mathbb{E}_{\theta\sim Q_\eta}\bigg[\frac{\log Q_{\eta}(\theta)}{\log P(\theta|X)}\bigg] $$

This integral cannot be solved in terms of simple functions in general. However, it is available in closed form when:

  1. We use a conjugate prior to define $P(\theta|X)$.

  2. We assume that variational distribution is independent, so in other words that $q_\eta(\theta)=\prod_{j=1}^P q_{j,\eta}(\theta_j)$.

  3. We further restrict ourselves to a particular $q_{j,\eta_j}$ for each $j$ (which is determined by the likelihood).

So it's not that the variational posterior is available in closed form. Rather, it's that the cost function which defines the variational posterior is available in closed form. The cost function being closed form makes computing the variational distribution an easier optimization problem, since we can analytically compute function values and gradients.

Related Question