Solved – What’s the difference between prior and marginal probabilities

bayesianmarginal-distributionpriorprobability

Let's say I have a distribution for a random variable S:

s | P(S=s)
--+-------
0 | .28
1 | .72

That's a prior, right? It represents our belief about the
likelihood of an event happening absent other information. It is fundamentally different from something like P(S=s|R=r), which represents our belief about S given exactly the information R.

Alternatively, I could be given a joint distribution for S and R and compute the marginal probabilities:

s | r | P(S=s, R=r)
--+---+------------
0 | 0 | 0.2
0 | 1 | 0.08
1 | 0 | 0.7
1 | 1 | 0.02

So the marginals are:

s | P(S=s)    r | P(R=r)
--+-------    --+-------
0 | .28       0 | 0.9
1 | .72       1 | 0.1

My question is: how is this marginal distribution for S any different than the prior for S? Is it only a matter of interpretation?

Best Answer

$P(S=s)$ and $P(R=r)$ both are marginal probabilities from the following table

$$ \begin{array}{c|cc|c} & R=0 & R=1 \\ \hline S=0 & 0.20 & 0.08 & 0.28 \\ S=1 & 0.70 & 0.02 & 0.72 \\ \hline & 0.90 & 0.10 & \end{array} $$

Given such table, you can calculate conditional probabilities $P(S \mid R)$, or $P(R \mid S)$ by applying Bayes theorem, e.g.

$$ P(S \mid R) = \frac{P(R \mid S) \, P(S)}{P(R)} = \frac{P(R \cap S)}{P(R)} $$

the same way you could calculate $P(R \mid S)$. Notice that to apply it you need to know either conditional, or joint probabilities. This is a basic application of Bayes theorem and it has many nice applications (see e.g. here).

Now important thing to notice: applying Bayes theorem is not the same as using Bayesian statistics. $P(S)$ in your example is not more prior, then $P(R)$. Moreover, to calculate the "posterior" probability you need to know the joint or conditional probabilities. If you are thinking of some simple example like "there is 0.7 probability that Jack has stolen an orange from the shop", you cannot apply Bayes theorem to such problem by assuming that in your opinion the probability is, for example, 0.3, unless you also know the joint probabilities (probability that he is guilty when you assume he is etc.), or conditional probabilities (probability that you assume that he is guilty given the fact that he is guilty). This is not the way how we use priors in statistics.

When applying Bayes theorem in statistics we have some data $X$ that can be described using probability density function $f_\theta$, but we do not know the value of it's parameter $\theta$. To estimate $\theta$ we can use many different statistical approaches, for example, maximum likelihood estimation by maximizing the likelihood function

$$ \DeclareMathOperator*{\argmax}{arg\,max} \argmax_{\theta} f_\theta( X ) $$

Other approach to the problem is to include some prior information into the process of estimating the parameter and use a Bayesian approach. This is done by using Bayes theorem, but in a different way. First, we assume some probability distribution for $\theta$, let's call it $g$, and then assume a priori that the unknown parameter follows this distribution. We use Bayes theorem to combine the two sources of information: our a priori assumptions about $\theta$, that is our prior $g$; and the information that is contained in the data, that is likelihood function $f_\theta(X)$, so to obtain posterior estimate $g(\theta | X)$:

$$ g(\theta | X) \propto f_\theta(X) \, g(\theta) $$

If this still sounds complicated, you can start by going through multiple other questions tagged for lots of examples. There is also many good introductory books to start with, e.g. Bayesian Data Analysis by Andrew Gelman et al, or Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan by John K. Kruschke.

Related Question