Beta Distribution – Understanding the -1 in Density Function

beta distributionbeta-binomial distributiondistributionshistoryreferences

Beta distribution appears under two parametrizations (or here)

$$ f(x) \propto x^{\alpha} (1-x)^{\beta} \tag{1} $$

or the one that seems to be used more commonly

$$ f(x) \propto x^{\alpha-1} (1-x)^{\beta-1} \tag{2} $$

But why exactly is there "$-1$" in the second formula?

The first formulation intuitively seem to more directly correspond to binomial distribution

$$ g(k) \propto p^k (1-p)^{n-k} \tag{3} $$

but "seen" from the $p$'s perspective. This is especially clear in beta-binomial model where $\alpha$ can be understood as a prior number of successes and $\beta$ is a prior number of failures.

So why exactly did the second form gain popularity and what is the rationale behind it? What are the consequences of using either of the parametrization (e.g. for the connection with binomial distribution)?

It would be great if someone could additionally point origins of such choice and the initial arguments for it, but it is not a necessity for me.

Best Answer

This is a story about degrees of freedom and statistical parameters and why it is nice that the two have a direct simple connection.

Historically, the "$-1$" terms appeared in Euler's studies of the Beta function. He was using that parameterization by 1763, and so was Adrien-Marie Legendre: their usage established the subsequent mathematical convention. This work antedates all known statistical applications.

Modern mathematical theory provides ample indications, through the wealth of applications in analysis, number theory, and geometry, that the "$-1$" terms actually have some meaning. I have sketched some of those reasons in comments to the question.

Of more interest is what the "right" statistical parameterization ought to be. That is not quite as clear and it doesn't have to be the same as the mathematical convention. There is a huge web of commonly used, well-known, interrelated families of probability distributions. Thus, the conventions used to name (that is, parameterize) one family typically imply related conventions to name related families. Change one parameterization and you will want to change them all. We might therefore look at these relationships for clues.

Few people would disagree that the most important distribution families derive from the Normal family. Recall that a random variable $X$ is said to be "Normally distributed" when $(X-\mu)/\sigma$ has a probability density $f(x)$ proportional to $\exp(-x^2/2)$. When $\sigma=1$ and $\mu=0$, $X$ is said to have a standard normal distribution.

Many datasets $x_1, x_2, \ldots, x_n$ are studied using relatively simple statistics involving rational combinations of the data and low powers (typically squares). When those data are modeled as random samples from a Normal distribution--so that each $x_i$ is viewed as a realization of a Normal variable $X_i$, all the $X_i$ share a common distribution, and are independent--the distributions of those statistics are determined by that Normal distribution. The ones that arise most often in practice are

  1. $t_\nu$, the Student $t$ distribution with $\nu = n-1$ "degrees of freedom." This is the distribution of the statistic $$t = \frac{\bar X}{\operatorname{se}(X)}$$ where $\bar X = (X_1 + X_2 + \cdots + X_n)/n$ models the mean of the data and $\operatorname{se}(X) = (1/\sqrt{n})\sqrt{(X_1^2+X_2^2 + \cdots + X_n^2)/(n-1) - \bar X^2}$ is the standard error of the mean. The division by $n-1$ shows that $n$ must be $2$ or greater, whence $\nu$ is an integer $1$ or greater. The formula, although apparently a little complicated, is the square root of a rational function of the data of degree two: it is relatively simple.

  2. $\chi^2_\nu$, the $\chi^2$ (chi-squared) distribution with $\nu$ "degrees of freedom" (d.f.). This is the distribution of the sum of squares of $\nu$ independent standard Normal variables. The distribution of the mean of the squares of these variables will therefore be a $\chi^2$ distribution scaled by $1/\nu$: I will refer to this as a "normalized" $\chi^2$ distribution.

  3. $F_{\nu_1, \nu_2}$, the $F$ ratio distribution with parameters $(\nu_1, \nu_2)$ is the ratio of two independent normalized $\chi^2$ distributions with $\nu_1$ and $\nu_2$ degrees of freedom.

Mathematical calculations show that all three of these distributions have densities. Importantly, the density of the $\chi^2_\nu$ distribution is proportional to the integrand in Euler's integral definition of the Gamma ($\Gamma$) function. Let's compare them:

$$f_{\chi^2_\nu}(2x) \propto x^{\nu/2 - 1}e^{-x};\quad f_{\Gamma(\nu)}(x) \propto x^{\nu-1}e^{-x}.$$

This shows that twice a $\chi^2_\nu$ variable has a Gamma distribution with parameter $\nu/2$. The factor of one-half is bothersome enough, but subtracting $1$ would make the relationship much worse. This already supplies a compelling answer to the question: if we want the parameter of a $\chi^2$ distribution to count the number of squared Normal variables that produce it (up to a factor of $1/2$), then the exponent in its density function must be one less than half that count.

Why is the factor of $1/2$ less troublesome than a difference of $1$? The reason is that the factor will remain consistent when we add things up. If the sum of squares of $n$ independent standard Normals is proportional to a Gamma distribution with parameter $n$ (times some factor), then the sum of squares of $m$ independent standard Normals is proportional to a Gamma distribution with parameter $m$ (times the same factor), whence the sum of squares of all $n+m$ variables is proportional to a Gamma distribution with parameter $m+n$ (still times the same factor). The fact that adding the parameters so closely emulates adding the counts is very helpful.

If, however, we were to remove that pesky-looking "$-1$" from the mathematical formulas, these nice relationships would become more complicated. For example, if we changed the parameterization of Gamma distributions to refer to the actual power of $x$ in the formula, so that a $\chi^2_1$ distribution would be related to a "Gamma$(0)$" distribution (since the power of $x$ in its PDF is $1-1=0$), then the sum of three $\chi^2_1$ distributions would have to be called a "Gamma$(2)$" distribution. In short, the close additive relationship between degrees of freedom and the parameter in Gamma distributions would be lost by removing the $-1$ from the formula and absorbing it in the parameter.

Similarly, the probability function of an $F$ ratio distribution is closely related to Beta distributions. Indeed, when $Y$ has an $F$ ratio distribution, the distribution of $Z=\nu_1 Y/(\nu_1 Y + \nu_2)$ has a Beta$(\nu_1/2, \nu_2/2)$ distribution. Its density function is proportional to

$$f_Z(z) \propto z^{\nu_1/2 - 1}(1-z)^{\nu_2/2-1}.$$

Furthermore--taking these ideas full circle--the square of a Student $t$ distribution with $\nu$ d.f. has an $F$ ratio distribution with parameters $(1,\nu)$. Once more it is apparent that keeping the conventional parameterization maintains a clear relationship with the underlying counts that contribute to the degrees of freedom.

From a statistical point of view, then, it would be most natural and simplest to use a variation of the conventional mathematical parameterizations of $\Gamma$ and Beta distributions: we should prefer calling a $\Gamma(\alpha)$ distribution a "$\Gamma(2\alpha)$ distribution" and the Beta$(\alpha, \beta)$ distribution ought to be called a "Beta$(2\alpha, 2\beta)$ distribution." In fact, we have already done that: this is precisely why we continue to use the names "Chi-squared" and "$F$ Ratio" distribution instead of "Gamma" and "Beta". Regardless, in no case would we want to remove the "$-1$" terms that appear in the mathematical formulas for their densities. If we did that, we would lose the direct connection between the parameters in the densities and the data counts with which they are associated: we would always be off by one.

Related Question