Bayesian vs Frequentist Debate: Mathematical Foundations and Differences

bayesianfrequentistkolmogorov-axiomsphilosophicalprobability

It says on Wikipedia that:

the mathematics [of probability] is largely independent of any interpretation of probability.

Question: Then if we want to be mathematically correct, shouldn't we disallow any interpretation of probability? I.e., are both Bayesian and frequentism mathematically incorrect?

I don't like philosophy, but I do like math, and I want to work exclusively within the framework of Kolmogorov's axioms. If this is my goal, should it follow from what it says on Wikipedia that I should reject both Bayesianism and frequentism? If the concepts are purely philosophical and not at all mathematical, then why do they appear in statistics in the first place?

Background/Context:
This blog post doesn't quite say the same thing, but it does argue that attempting to classify techniques as "Bayesian" or "frequentist" is counter-productive from a pragmatic perspective.

If the quote from Wikipedia is true, then it seems like from a philosophical perspective attempting to classify statistical methods is also counter-productive — if a method is mathematically correct, then it is valid to use the method when the assumptions of the underlying mathematics hold, otherwise, if it is not mathematically correct or if the assumptions do not hold, then it is invalid to use it.

On the other hand, a lot of people seem to identify "Bayesian inference" with probability theory (i.e. Kolmogorov's axioms), although I'm not quite sure why. Some examples are Jaynes's treatise on Bayesian inference called "Probability", as well as James Stone's book "Bayes' Rule". So if I took these claims at face value, that means I should prefer Bayesianism.

However, Casella and Berger's book seems like it is frequentist because it discusses maximum likelihood estimators but ignores maximum a posteriori estimators, but it also seems like everything therein is mathematically correct.

So then wouldn't it follow that the only mathematically correct version of statistics is that which refuses to be anything but entirely agnostic with respect to Bayesianism and frequentism? If methods with both classifications are mathematically correct, then isn't it improper practice to prefer some over the others, because that would be prioritizing vague, ill-defined philosophy over precise, well-defined mathematics?

Summary: In short, I don't understand what the mathematical basis is for the Bayesian versus frequentist debate, and if there is no mathematical basis for the debate (which is what Wikipedia claims), I don't understand why it is tolerated at all in academic discourse.

Best Answer

Probability spaces and Kolmogorov's axioms

A probability space $\mathcal{P}$ is by definition a tripple $(\Omega, \mathcal{F}, \mathbb{P} )$ where $\Omega$ is a set of outcomes, $\mathcal{F}$ is a $\sigma$-algebra on the subsets of $\Omega$ and $\mathbb{P}$ is a probability-measure that fulfills the axioms of Kolmogorov, i.e. $\mathbb{P}$ is a function from $\mathcal{F}$ to $[0,1]$ such that $\mathbb{P}(\Omega)=1$ and for disjoint $E_1, E_2, \dots$ in $\mathcal{F}$ it holds that $P \left( \cup_{j=1}^\infty E_j \right)=\sum_{j=1}^\infty \mathbb{P}(E_j)$.

Within such a probability space one can, for two events $E_1, E_2$ in $\mathcal{F}$ define the conditional probability as $\mathbb{P}(E_1|_{E_2})\stackrel{def}{=}\frac{\mathbb{P}(E_1 \cap E_2)}{\mathbb{P}(E_2)}$

Note that:

  1. this ''conditional probability'' is only defined when $\mathbb{P}$ is defined on $\mathcal{F}$, so we need a probability space to be able to define conditional probabilities.
  2. A probability space is defined in very general terms (a set $\Omega$, a $\sigma$-algebra $\mathcal{F}$ and a probability measure $\mathbb{P}$), the only requirement is that certain properties should be fulfilled but apart from that these three elements can be ''anything''.

More detail can be found in this link

Bayes' rule holds in any (valid) probability space

From the definition of conditional probability it also holds that $\mathbb{P}(E_2|_{E_1})=\frac{\mathbb{P}(E_2 \cap E_1)}{\mathbb{P}(E_1)}$. And from the two latter equations we find Bayes' rule. So Bayes' rule holds (by definition of conditional probabilty) in any probability space (to show it, derive $\mathbb{P}(E_1 \cap E_2)$ and $\mathbb{P}(E_2 \cap E_1)$ from each equation and equate them (they are equal because intersection is commutative)).

As Bayes rule is the basis for Bayesian inference, one can do Bayesian analysis in any valid (i.e. fulfilling all conditions, a.o. Kolmogorov's axioms) probability space.

Frequentist definition of probability is a ''special case''

The above holds ''in general'', i.e. we have no specific $\Omega$, $\mathcal{F}$, $\mathbb{P}$ in mind as long as $\mathcal{F}$ is a $\sigma$-algebra on subsets of $\Omega$ and $\mathbb{P}$ fulfills Kolmogorov's axioms.

We will now show that a ''frequentist'' definition of $\mathbb{P}$ fulfills Kolomogorov's axioms. If that is the case then ''frequentist'' probabilities are only a special case of Kolmogorov's general and abstract probability.

Let's take an example and roll the dice. Then the set of all possible outcomes $\Omega$ is $\Omega=\{1,2,3,4,5,6\}$. We also need a $\sigma$-algebra on this set $\Omega$ and we take $\mathcal{F}$ the set of all subsets of $\Omega$, i.e. $\mathcal{F}=2^\Omega$.

We still have to define the probability measure $\mathbb{P}$ in a frequentist way. Therefore we define $\mathbb{P}(\{1\})$ as $\mathbb{P}(\{1\}) \stackrel{def}{=} \lim_{n \to +\infty} \frac{n_1}{n}$ where $n_1$ is the number of $1$'s obtained in $n$ rolls of the dice. Similar for $\mathbb{P}(\{2\})$, ... $\mathbb{P}(\{6\})$.

In this way $\mathbb{P}$ is defined for all singletons in $\mathcal{F}$. For any other set in $\mathcal{F}$, e.g. $\{1,2\}$ we define $\mathbb{P}(\{1,2\})$ in a frequentist way i.e. $\mathbb{P}(\{1,2\}) \stackrel{def}{=} \lim_{n \to +\infty} \frac{n_1+n_2}{n}$, but by the linearity of the 'lim', this is equal to $\mathbb{P}(\{1\})+\mathbb{P}(\{2\})$, which implies that Kolmogorov's axioms hold.

So the frequentist definition of probability is only a special case of Kolomogorov's general and abstract definition of a probability measure.

Note that there are other ways to define a probability measure that fulfills Kolmogorov's axioms, so the frequentist definition is not the only possible one.

Conclusion

The probability in Kolmogorov's axiomatic system is ''abstract'', it has no real meaning, it only has to fulfill conditions called ''axioms''. Using only these axioms Kolmogorov was able to derive a very rich set of theorems.

The frequentist definition of probability fullfills the axioms and therefore replacing the abstract, ''meaningless'' $\mathbb{P}$ by a probability defined in a frequentist way, all these theorems are valid because the ''frequentist probability'' is only a special case of Kolmogorov's abstract probability (i.e. it fulfills the axioms).

One of the properties that can be derived in Kolmogorov's general framework is Bayes rule. As it holds in the general and abstract framework, it will also hold (cfr supra) in the specific case that the probabilities are defined in a frequentist way (because the frequentist definition fulfills the axioms and these axioms were the only thing that is needed to derive all theorems). So one can do Bayesian analysis with a frequentist definition of probability.

Defining $\mathbb{P}$ in a frequentist way is not the only possibility, there are other ways to define it such that it fulfills the abstract axioms of Kolmogorov. Bayes' rule will also hold in these ''specific cases''. So one can also do Bayesian analysis with a non-frequentist definition of probability.

EDIT 23/8/2016

@mpiktas reaction to your comment:

As I said, the sets $\Omega, \mathcal{F}$ and the probability measure $\mathbb{P}$ have no particular meaning in the axiomatic system, they are abstract.

In order to apply this theory you have to give further definitions (so what you say in your comment "no need to muddle it further with some bizarre definitions'' is wrong, you need additional definitions).

Let's apply it to the case of tossing a fair coin. The set $\Omega$ in Kolmogorov's theory has no particular meaning, it just has to be ''a set''. So we must specify what this set is in case of the fair coin, i.e. we must define the set $\Omega$. If we represent head as H and tail as T, then the set $\Omega$ is by definition $\Omega\stackrel{def}{=}\{H,T\}$.

We also have to define the events, i.e. the $\sigma$-algebra $\mathcal{F}$. We define is as $\mathcal{F} \stackrel{def}{=} \{\emptyset, \{H\},\{T\},\{H,T\} \}$. It is easy to verify that $\mathcal{F}$ is a $\sigma$-algebra.

Next we must define for every event in $E \in \mathcal{F}$ its measure. So we need to define a map from $\mathcal{F}$ in $[0,1]$. I will define it in the frequentist way, for a fair coin, if I toss it a huge number of times, then the fraction of heads will be 0.5, so I define $\mathbb{P}(\{H\})\stackrel{def}{=}0.5$. Similarly I define $\mathbb{P}(\{T\})\stackrel{def}{=}0.5$, $\mathbb{P}(\{H,T\})\stackrel{def}{=}1$ and $\mathbb{P}(\emptyset)\stackrel{def}{=}0$. Note that $\mathbb{P}$ is a map from $\mathcal{F}$ in $[0,1]$ and that it fulfills Kolmogorov's axioms.

For a reference with the frequentist definition of probability see this link (at the end of the section 'definition') and this link.