Why does the beta distribution become U shaped when $\alpha$ and $\beta$ <1

beta functionprobability

In the Beta distribution (used to model Bernoulli probabilities), the $\alpha$ and $\beta$ parameters can be interpreted as the number of heads$+1$ and the number of tails$+1$ seen. So, if they were both $2$, it would lean towards the coin being fair and have a maximum at $0.5$. If they are both $20$, the distribution would become even surer we're dealing with a fair coin and peak even more at $p=0.5$.

What I don't get is its behavior when $\alpha$ and $\beta$ both become $<1$.

In that case, it becomes U-shaped and the density peaks at $p=0$ and $p=1$. Meaning the coin is likely to be two-sided. I know there is an intuition for this since I think I had an idea about it a long time ago. However, I've been trying to recollect all day and can't piece it together. Does anyone have an intuition?

Best Answer

As Sample Size Decreases, Variance Increases, which Requires a Bimodal Distribution

Thinking about variance provides one explanation for the U-shaped Beta. As always, a larger sample size (${\displaystyle \alpha + \beta }$) decreases a distribution's variance, and a smaller sample size increases it. If Betas were limited to unimodal distributions, their variance could never reach its full potential. In order to maximize a Beta distribution's variance for a particular mean, the distribution must become bimodal, with its density concentrated at the two extremes. At the limit, as the variance approaches its maximum (for any given mean), the Beta distribution approaches a Bernoulli distribution and its variance likewise approaches the variance of a Bernoulli with its same mean.

Different Interpretations of Alpha and Beta: Mean-Based vs. Mode-Based

The Wikipedia entry for Conjugate Prior (note 3) offers crucial advice about the interpretation of Beta parameters:

"The exact interpretation of the parameters of a beta distribution in terms of number of successes and failures depends on what function is used to extract a point estimate from the distribution. The mean of a beta distribution is ${\displaystyle {\frac {\alpha }{\alpha +\beta }},}$ which corresponds to $\alpha$ successes and $\beta$ failures, while the mode is ${\displaystyle {\frac {\alpha -1}{\alpha +\beta -2}},}$ which corresponds to ${\displaystyle \alpha -1}$ successes and ${\displaystyle \beta -1}$ failures. Bayesians generally prefer to use the posterior mean rather than the posterior mode as a point estimate, justified by a quadratic loss function, and the use of ${\displaystyle \alpha }$ and ${\displaystyle \beta }$ is more convenient mathematically, while the use of ${\displaystyle \alpha -1}$ and ${\displaystyle \beta -1}$ has the advantage that a uniform ${\displaystyle {\rm {Beta}}(1,1)}$ prior corresponds to $0$ successes and $0$ failures."

A similar point is made by Tom Minka in his answer to a related question.

The contrast between these two interpretations becomes especially stark in the case of bimodal Betas, since they have two modes yet only a single mean. Focusing on the example of a fair coin, as this question does, hides the issue because that's the unusual case where the difference between the mean and mode disappears.

In Doing Bayesian Data Analysis, John Kruschke notes that a bimodal Beta would mean we "believe that the coin is a trick coin that nearly always comes up heads or nearly always comes up tails, but we don't know which." (p. 83, 1st ed.) And since that's a rather contrived scenario, it confirms the limitations of the coin tossing example.

Note that if we interpret ${\displaystyle \alpha }$ as successes + 1 and ${\displaystyle \beta }$ as failures + 1, then the success count and failure count must both turn negative when ${\displaystyle \alpha }$ and ${\displaystyle \beta }$ are less than 1. By contrast, if we interpret ${\displaystyle \alpha }$ and ${\displaystyle \beta }$ as successes and failures, respectively, without subtracting 1, then we sidestep the seemingly nonsensical idea of negative counts. Even when ${\displaystyle \alpha }$ and ${\displaystyle \beta }$ are both less than 1, the mean-based interpretation of them poses no issues since the mean remains a single value even when the mode splits off into two.

Just as there are some contexts when it makes sense to focus on a distribution's mean and other contexts when it makes sense to focus on its mode(s), so too will our interpretation of ${\displaystyle \alpha }$ and ${\displaystyle \beta }$ depend on what central tendency is of interest in a particular context. As Wikipedia's Conjugate Prior entry entry puts it:

"It is often useful to think of the hyperparameters of a conjugate prior distribution as corresponding to having observed a certain number of pseudo-observations with properties specified by the parameters. For example, the values ${\displaystyle \alpha}$ and ${\displaystyle \beta}$ of a beta distribution can be thought of as corresponding to ${\displaystyle \alpha -1}$ successes and ${\displaystyle \beta -1}$ failures if the posterior mode is used to choose an optimal parameter setting, or ${\displaystyle \alpha}$ successes and ${\displaystyle \beta}$ failures if the posterior mean is used to choose an optimal parameter setting."

You might find this discussion of bimodal Betas helpful too.

The Polya Urn Interpretation Yields Nice Intuitions about U-Shaped Betas

A lesser-known, but surprisingly accessible interpretation of the Beta distribution views it as the result of draws from a Polya urn model. Rather than attempting a full proof here, I will simply explain how this alternative interpretation yields an attractively intuitive explanation of U-shaped Betas.

The basic idea is that an urn initially contains S number of success balls and F number of failure balls, which correspond to the ${\displaystyle \alpha}$ and ${\displaystyle \beta}$ parameters (we are interested in the distribution's mean). After drawing a single ball from the urn, you not only replace it, but add an additional ball of its same type. In the limit, drawing and then adding an infinite number of balls in this way yields a single proportion drawn from a Beta(success, failure).

One can see that every successive draw will have slightly less impact over the resulting, limit ratio than did the draw before it. Starting with a Beta(1,1) means that the urn's ratio will shift from 1/2 to either 1/3 or 2/3 once a third ball gets introduced. With each successive introduction of a new ball, that new ball's influence over successive draws shrinks.

Now, this same Polya urn procedure can be applied with fractional balls if one stipulates (1) that the likelihood of drawing a fractional ball remains proportional to its size--a half ball remains half as likely to be drawn as any whole ball--and (2) that when drawn and replaced, fractional balls are nevertheless accompanied by whole balls of their same type. Fractional balls thereby acquire an influence disproportionate to their actual size.

Consider an example where ${\displaystyle \alpha}$ and ${\displaystyle \beta}$ both begin at .1, so that the initial draw provides even odds of drawing either the one-tenth success ball or the one-tenth failure ball. Because fractional balls are accompanied by whole balls when replaced, the first whole ball introduced will dominate all subsequent draws. What started off as a ratio of .1 success balls to .2 balls immediately veers toward a lopsided ratio of either ${\displaystyle {\frac {.1}{1.2}}}$ or ${\displaystyle {\frac {1.1}{1.2}}}$. Indeed, that initial draw so dominates all subsequent draws that the ratio is likely to grow increasingly lopsided over time. Once the ratio tilts decisively away from ${\displaystyle {\frac {.1}{.2}}}$ it is exceedingly unlikely to ever return to anything comparably balanced. And, of course, that effect becomes even more pronounced if one starts with a Beta(.001, .001): the U-shape becomes thinner and thinner in the middle and thicker and thicker at the extremes as the sum of ${\displaystyle \alpha}$ and ${\displaystyle \beta}$ gets smaller because the initial draw more completely dominates the subsequent draws.

While the equivalence between Betas and Polya urns is hardly obvious, the Polya urn offers elegant insights into the Beta distribution.