Bayesian Statistics – Gentler Approaches for Effective Hypothesis Testing

bayesianhypothesis testing

I recently started reading "Introduction to Bayesian Statistics" 2nd Edition by Bolstad. I've had an introductory stats class that covered mainly statistical tests and am almost through a class in regression analysis. What other books can I use to supplement my understanding this one?

I've made it through the first 100-125 pages fine. Afterwards the book begins to talk hypothesis testing which is what I'm very excited to cover but there a couple of things throwing me:

  • The use of probability density functions in calculations. In other words how to evaluate such equations.
  • This whole sentence: "Suppose we use a beta(1,1) prior for pi. Then given y=8, the posterior density is beta(9,3). The posterior probability of the null hypothesis is…" I believe beta(1,1) refers to a PDF where the mean is 1 and the stdev is 1? I don't get how it would change to a beta(9,3) as a posterior density function.

I do get the concept of priors vs posteriors and understand how to apply them using a table manually. I get (I think!) that pi represents the supposed population proportion or probability.

I don't get how to connect this together with data I would run into on a day to day basis and get results.

Best Answer

The use of probability density functions in calculations. In other words how to evaluate such equations.

I think you're still thinking of this from a frequentist perspective: if you're looking for a point estimate, the posterior won't give it to you. You put PDFs in, you get PDFs out. You can derive point estimates by calculating statistics from your posterior distribution, but I'll get to that in a bit.

I do get the concept of priors vs posteriors and understand how to apply them using a table manually. I get (I think!) that pi represents the supposed population proportion or probability.

$\pi(x)$ is the same thing as $p(x)$: they're both PDFs. $\pi$ is just conventionally used to denote that the particular PDF is a prior density.

I suspect that you don't get priors and posteriors as well as you think you do, so let's back it up to the fundamental underpinning of Bayesian statistics: Subjective Probability.

A Thought Experiment in Subjective Probability

Let's say I present you with a coin and ask you whether or not you think this coin is a fair coin. You've heard a lot of people talk about unfair coins in probability class, but you've never actually seen one in real life, so you respond, "Yeah, sure, I think it's a fair coin." But, the fact that I'm even asking you this question puts you off a little, so although your estimation is that it's fair, you wouldn't really be surprised if it wasn't. Much less surprised than if you found this coin in your pocket change (because you assume that's all real currency, and you don't really trust me right now becaue I'm acting suspicious).

Now, we run a few experiments. After 100 flips, the coin gives back 53 Heads. You're a lot more confident that it's a fair coin, but you're still open to the possibility that it's not. The difference is that now you would be pretty surprised if this coin turned out to have some sort of bias.

How can we represent your prior and posterior beliefs here, specifically, regarding the probability that the coin will show heads (which we will denote $\theta$)? In a frequentist setting, your prior belief--your null hypothesis--is that $\theta = 0.5$. After running the experiment, you're not able to reject the null, and so you continue on with the assumption that yes, the coin is probably fair. But how do we encapsulate the change in your confidence that the coin is fair? After the experiment you are in a position that you would bet that the coin is fair, but before the experiment you would have been trepidatious.

In the Bayesian setting, you encapsulate your confidence in propositions by not treating probabilities as scalar values but as random variables, i.e. functions. Instead of saying $\theta = 0.5$ we say $\theta \sim N(0.5, \sigma^2)$, and thereby encapsulate our confidence in the variance of the PDF. If we set a high variance, we're saying, "I think that the probability is 0.5, but I wouldn't be surprised if the probability I actually observe in the world is far away from this value. I think $\theta= 0.5$, but frankly I'm not really that sure." By setting a low variance, we're saying, "Not only do I believe the probability is 0.5, but I would be very surprised if experimentation provides a value that's not very close to $\theta=0.5$." So, in this example when you start the experiment you have a prior with high variance. After receiving data that corroborates your prior, the mean of the prior stayed the same, but the variance became much narrower. Our confidence that $\theta=0.5$ is much higher after running the experiment than before.

So how do we perform calculations?

We start with PDFs, and we end with PDFs. When you need to report a point estimate, you can calculate statistics like the mean, median or mode of your posterior distribution (depending on your loss function, which I won't get into now. Let's just stick with the mean). If you have a closed form solution for your PDF, it will probably be trivial to determine these values. If the posterior is complicated, you can use procedures like MCMC to sample from your posterior and derive statistics from the sample you drew.

In the example where you have a Beta prior and a Binomial likelihood, the calculation of the posterior reduces to a very clean calculation. Given:

  • Prior: $\theta \sim Beta(\alpha, \beta)$
  • Likelihood: $X|\theta \sim Binomial(\theta)$

Then the posterior reduces to:

  • Posterior: $\theta|X \sim Beta(\alpha + \sum_{i=1}^n x_i,\, \beta + n - \sum_{i=1}^n x_i)$

This will happen any time you have a beta prior and a binomial likelihood, and the reason why should be evident in the calculations provided by DJE. When a particular prior-likelihood model always gives a posterior that has the same kind of distribution as the prior, the relationship between the types of distributions used for the prior and likelihood is called Conjugate. There are many pairs of distributions that have conjugate relationships, and conjugacy is very frequently leveraged by Bayesians to simplify calculations. Given a particular likelihood, you can make your life a lot easier by selecting a conjugate prior (if one exists and you can justify your choice of prior).

I believe beta(1,1) refers to a PDF where the mean is 1 and the stdev is 1?

In the common parameterization of the normal distribution, the two parameters signify the mean and standard deviation of the distribution. But that's just how we parameterize the normal distribution. Other probability distributions are parameterized very differently.

The Beta distribution is usually parameterized as $Beta(\alpha, \beta)$ where $\alpha$ and $\beta$ are called "shape" parameters. The Beta distribution is extremely flexible and takes lots of different forms depending on how these parameters are set. To illustrate how different this parameterization is from your original assumption, here's how you calculate the mean and variance for Beta random variables:

\begin{equation} \begin{split} X &\sim Beta(\alpha, \beta) \\ \operatorname{E}[X] &= \frac{\alpha}{\alpha + \beta} \\ \operatorname{var}[X] &= \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \end{split} \end{equation}

As you can clearly see, the mean and variance are not a part of the parameterization of this distribution, but they have closed form solutions that are simple functions of the input parameters.

I won't go into detail describing the differences in parameterizations of other well known distributions, but I recommend you look up a few. Any basic text, even Wikipedia, should somewhat describe how changing the parameters modifies the distribution. You should also read up on the relationships between the different distributions (for instance, $Beta(1,1)$ is the same thing as $Uniform(0,1)$).

Related Question