Bayesian Statistics – How to Obtain Bayes Estimator with Conjugate Prior

bayesianbeta distributionconjugate-priorself-study

Consider n observations $ X_1, X_2,….X_n $ from $ Beta_1 ~ B(1,\theta ) $ distribution.

Obtain Bayes estimator for $ \theta $ under quadratic loss function when conjugate prior is assumed for $\theta $

My Doubts

So here prior distribution is conjugate , it means it will also be from family of beta 1 distribution so $ g(\theta ) $ = $ \frac{\theta^\alpha (1- \theta)^\beta }{\ B(\alpha, \beta )}$

So with the help of hint

$ g(\theta ) $ = $ \alpha \theta^{1-\alpha } $

Now the joint distribution is

$f^*(x,\theta ) $= $f(x|\theta ) \times g(\theta ) $ = $\theta \alpha \theta^{1-\alpha } x^{1-\theta } $

Now after this we will integrate $f^*(x,\theta ) $ to get the marginal distribution of $ x $ = $h(x) $
$h(x) $= $\int_0 ^1 \theta \alpha \theta^{1-\alpha } x^{1-\theta }$

$h(x) $= $\alpha x \int_0 ^1 \theta^{2-\alpha } x^{-\theta }$

Can anyone please tell the integration part , I am unable to figure it out ??

Best Answer

So here prior distribution is conjugate , it means it will also be from family of beta 1 distribution so

The conjugate distribution means that the prior and the posterior are the same distribution. This does not have to have to be the beta distribution.

Bayes theorem

Start with Bayes theorem formulated for continuous functions.

$$f_{posterior}(\theta \vert x) = \frac{f_{likelihood}(x\vert\theta) \cdot f_{prior}(\theta)}{f_{normalization}(x)}$$

likelihood: The probability(density) of the observations $x$ as function of values of $\theta$.

You know that the distribution of the $x_i$ is a $Beta(1,\theta)$ distribution with density function $$f(x_i \vert \theta) = \theta(1-x_i)^{\theta-1}$$ and for the entire sample $$f(x_1, x_2, \dots , x_n \vert \theta) = \theta^n\left( \prod_{i=1}^n (1-x_i) \right)^{\theta-1}= \theta^n(GM^n)^{\theta-1}$$ where $GM = \prod (1-x_i)^{1/n}$ is the geometric mean of the terms $(1-x_i)$
normalization: The prior probability of the observations, $f_{normalization}(x)$.

This is effectively a normalization constant. It is independent of $\theta$ and we can ignore it when we write $$f_{posterior}(\theta \vert x) \propto f_{likelihood}(x\vert\theta) \cdot f_{prior}(\theta)$$ where $\propto$ means 'proportional to'.

In this way, you can see Bayes theorem without having to worry about constants. What we need to know is how the posterior looks like in terms of a function of $\theta$ and we can worry about the constant term (independent of $\theta$) later.

Finding the conjugate distribution

So to look for the conjugate distribution you are looking for a function that remains in the same family after the multiplication with the likelihood

$$f_{posterior}(\theta \vert x) \propto \theta^n(GM^n)^{\theta-1} \cdot f_{prior}(\theta)$$

Now this is not a straightforward technique to get the conjugate distribution, but what I do is imagine a function whose form remains unchanged after multiplication with $\theta^n(GM^n)^{\theta-1}$. I let my mind pass all sorts of forms, polynomials, exponentials, powers... powers?

If we have a power law like $\theta^a$ and multiply with $\theta^n$ then we have again a lower law, but with a different exponent, namely: $$\theta^n\theta^a = \theta^{n+a} = \theta^{a^\prime}$$ where $a^\prime = a + n$.
If we have a power law like $b^{\theta-1}$ and multiply with $(GM^n)^{\theta-1}$ then we have again a power law, but a different base, namely: $$(GM^n)^{\theta-1} \cdot b^{\theta-1} = (GM^n\cdot b)^{\theta-1} = {b^\prime}^{\theta-1}$$ where $b^\prime = GM^n\cdot b$

So, the conjugate distribution needs to be of the form $$f(\theta) \propto \theta^a \cdot b^{\theta-1} = \frac{1}{b} \theta^a \cdot b^{\theta}$$

This $\theta^a \cdot b^{\theta}$ looks familiar, it is the gamma distribution.

Thus, the conjugate distribution is the gamma distribution.

The coefficients of the posterior can be expressed in terms of the coefficients of the prior, by the previously described $a^\prime = a + n$ and $b^\prime = GM^n\cdot b$. I will let you figure out the change of the constant yourselves.

Possibly it is better to use the gamma distribution not parameterized by $a$ and $b$ as above, but instead, redo the above work with $f(\theta) = \frac{\beta^\alpha}{\Gamma(\alpha)} \theta^{\alpha-1} e^{-\beta \theta} \propto \theta^{\alpha-1} e^{-\beta \theta}$

The difference is in $e^{-\beta \theta}$ and $b^{\theta}$ which is the same if you set $e^{-\beta} = b$.

Related Solutions

Solved – Bayes estimator of Bernoulli random variables

You drop the marginal density $p(x)$ (the normalizing constant ) because it is a function of the data which are fixed (in the Bayesian context ) but that leads the posterior density $p(\theta|x)$ to lose some properties like integration to 1 (improper density )over the domain of $\theta$ but this is not a big deal since we are usually not interested in integrating the function but in maximising it , so multiplying this function with the constant does not change the $\theta$ that corresponds to the maximum point (MAP).

Now given a binomial likelihood over $r$ success ($Y$ as you denote it ) in $n$ Bernoulli trials each independent and conditional on unknown success parameter $\theta \in [0,1]$ with prior density $\theta \sim Beta(\alpha,\beta)$ . If you drop the constants in the likelihood and the prior you get the kernel of beta density(the posterior ) that means the posterior is proportional (not equal )to the likelihood * prior such that

$$p(\theta|r,n) \propto {\theta}^r (1-\theta)^{n-r} * {\theta}^{\alpha -1}(1-\theta)^{\beta-1} = {\theta}^{r+\alpha-1}(1-\theta)^{n-r+\beta-1 }$$

Now to make this density proper(new beta density ) we multiply it with the constant $c$ which will ensure that this posterior density integrate to 1 such that :

$$p(\theta|r,n)= c{\theta}^{r+\alpha-1}(1-\theta)^{n-r+\beta-1 }$$ Note no proportionality anymore, now let $c$ be as follows $$c=\frac{\Gamma(n+\alpha+\beta)}{\Gamma(r+\alpha)\Gamma(n-r+\beta)}$$ That means $$ \int_{0}^{1}{\theta}^{r+\alpha-1}(1-\theta)^{n-r+\beta-1 }d\theta=c^{-1}$$ $$\theta|r,n \sim Beta(\alpha+r,\beta +n-r)$$ then $E(\theta|r,n)=\frac{\alpha+r}{\alpha+n+\beta}$ .

Bayesian Estimation – Estimating IID Samples from Uniform[0,?] with Pareto(?,?) Prior for ?

The Pareto is indeed conjugate to the uniform, see e.g. Aside from the exponential family, where else can conjugate priors come from?.

The posterior mean looks right, see also https://en.wikipedia.org/wiki/Pareto_distribution (in Wikipedia's notation, $\alpha>1$ is guaranteed as the present $\alpha$ is positive and the sample size $n\geq1$).

The result that the posterior mean is a weighted average of prior mean and MLE (which the sample mean is not, though, so that I am not sure why to expect that in the first place?) is restricted to certain parametrizations in exponential families (and the uniform is not a member). See e.g. Can the posterior mean always be expressed as a weighted sum of the maximum likelihood estimate and the prior mean? or How does Prior Variance Affect Discrepancy between MLE and Posterior Expectation.

We have that the maximum $X_{(n)}$ is consistent for $\theta$. This follows from, e.g., https://math.stackexchange.com/questions/2905482/expectation-and-variance-of-y-maxx-1-ldots-x-n-where-x-is-uniformly-dis (slightly adapting the argument from a uniform on $[0,1]$ to one on $[0,\theta]$; essentially, work with cdf $y/\theta$ on $[0,\theta]$ instead of cdf $y$ on $[0,1]$) and noting that $E(X_{(n)})\to\theta$ and $V(X_{(n)})\to0$, so mean square convergence which implies consistency.

Also, $(n+\alpha)/(n+\alpha−1)\to1$ the posterior mean will tend to either the true $\theta$ or, when $\beta\geq X_{(n)}$, to $\beta$. [One could additionally consider the variance of the Pareto posterior, which is $\mathcal{O}(n^{-2})$.]

Asymptotically, the latter only seems possible when $\beta$ is larger than the true $\theta$ in view of consistency of $X_{(n)}$ for $\theta$. In that case, the support of the prior does not include the true parameter so that the posterior mean cannot concentrate on the true value.

Here is a little plot with posteriors for different $n$ and one prior choice for $\beta$ smaller (solid) and one larger than the true upper bound of the uniform (vertical black bar). We notice how the posterior concentrates around either the sample maximum or $\beta$.

library(EnvStats)

theta <- 1
beta.low <- 0.8
beta.high <- 1.03
alpha <- 0.5

n <- c(10, 20, 30, 50)
x <- runif(n[4], 0, theta)

x.n <- sapply(n, function(i) max(x[1:i]))

alpha.n <- alpha + n
beta.nlow <- max(x.n, beta.low)
beta.nhigh <- max(x.n, beta.high)

theta.ax <- seq(0.95, 1.1, by=.0001)
plot(theta.ax, dpareto(theta.ax, beta.nlow, alpha.n[4]), type="l", lwd=2, col="deeppink4")
lines(theta.ax, dpareto(theta.ax, beta.nlow, alpha.n[3]), type="l", lwd=2, col="lightblue")
lines(theta.ax, dpareto(theta.ax, beta.nlow, alpha.n[2]), type="l", lwd=2, col="orange")
lines(theta.ax, dpareto(theta.ax, beta.nlow, alpha.n[1]), type="l", lwd=2, col="chartreuse")

lines(theta.ax, dpareto(theta.ax, beta.nhigh, alpha.n[4]), type="l", lwd=2, col="deeppink4", lty=2)
lines(theta.ax, dpareto(theta.ax, beta.nhigh, alpha.n[3]), type="l", lwd=2, col="lightblue", lty=2)
lines(theta.ax, dpareto(theta.ax, beta.nhigh, alpha.n[2]), type="l", lwd=2, col="orange", lty=2)
lines(theta.ax, dpareto(theta.ax, beta.nhigh, alpha.n[1]), type="l", lwd=2, col="chartreuse", lty=2)

abline(v=theta, lwd=4)

Quibbles: The posterior mean is only "the" Bayes estimator when you working with the squared error loss function.

Also, you could omit the lower indicator in the likelihood function since you know that all $X_i$ are nonnegative.

To indicate that the posterior is proportional to some kernel of a distribution, it is more common to use $\propto$ than $\approx$.