[Math] Do the Kolmogorov’s axioms permit speaking of frequencies of occurence in any meaningful sense

foundationslogicphilosophyprobabilityprobability theory

It is frequently stated (in textbooks, on Wikipedia) that the "Law of large numbers" in mathematical probability theory is a statement about relative frequencies of occurrence of an event in a finite number of trials or that it "relates the axiomatic concept of probability to the statistical concept of frequency". Isn't this is a methodological mistake of ascribing an interpretation to a mathematical term, perhaps relying too much on the colorful language, that does not at all follow from how this term is mathematically defined? Recall the typical derivation of the WLLN:

Let $X_1, X_2, …, X_n$ be a sequence of n independent and identically distributed random variables with the same finite mean $\mu$, and with variance $\sigma^2$ and let:

$\overline{X}=\tfrac1n(X_1+\cdots+X_n)$

We have:

$E[\overline{X}] = \frac{E[X_1+…+X_n]}{n} = \frac{E[X_1]+…+E[X_n]}{n} =
\frac{n\mu}{n} = \mu$ $Var[\overline{X}] = \frac{Var[X_1+…+X_n]}{n^2} = \frac{Var[X_1]+…+Var[X_n]}{n^2} = \frac{n\sigma^2}{n^2} =
\frac{\sigma^2}{n}$

And from Chebyshev's inequality:

$P(|\overline{X}-\mu|>\epsilon) \le \frac{\sigma^2}{n\epsilon^2}$

And so X is said to converge in probability to $\mu$.

Now consider what is strictly speaking the meaning of this expression in the axiomatic framework it is derived in:

$P(|\overline{X}-\mu|>\epsilon) \le \frac{\sigma^2}{n\epsilon^2}$

$P()$, everywhere it occurs in the derivation, is known only to be a number satisfying Kolmogorov's axioms, so a number between 0 and 1, and so forth, but none of the axioms introduce any theoretical equivalent of the intuitive notion of frequency. If additional assumptions about $P()$ are not made, the sentence can obviously not be interpreted at all, but what is also important the theoretical mean $\mu$ is not necessarily the mean value in an infinite number of trials, $\overline{X}$ is not necessarily the mean value from n trials, and so forth. Consider an experiment of tossing a fair coin repeatedly – quite obviously, nothing in Kolmogorov's axioms enforces using 1/2 for the probability of heads, you could just as well use $1/\sqrt{\pi}$, yet the derivation continues to "work", except the meaning of the various variables is not in agreement with their intuitive interpretations. The $P()$ might still mean something, it might be a quantification of an absurd belief of mine, the mathematical derivation continues be true regardless, in the sense that as long as the initial $P()'s$ satisfy axioms, theorems about other $P()'s$ follow, and with Kolmogorov's axioms providing only weak constraints on and not a definition of $P()$, it's basically only symbol manipulation.

This "relative frequency" interpretation frequently given seems to rest on an additional assumption, and this assumption seems to be a form of the law of large numbers itself. Consider this fragment from Kolmogorov's Grundbegriffe on applying the results of probability theory to the real world:

We apply the theory of probability to the actual world of experiment
in the following manner:

4) Under certain conditions, which we shall not discuss here, we may
assume that the event A which may or may not occur under conditions S,
is assigned a real number P(A) which has the following
characteristics:

a) One can be practically certain that if the complex of conditions S
is repeated a large number of times, n, then if m be the number of
occurrences of event A, the ratio m/n will differ very slightly from
P(A).

Which seems equivalent to introducing the weak law of large numbers in a particular, slightly different form, as an additional axiom.

Meanwhile, many reputable sources contain statements that seem completely in opposition to the above reasoning, for example Wikipedia:

It follows from the law of large numbers that the empirical
probability of success in a series of Bernoulli trials will converge
to the theoretical probability. For a Bernoulli random variable, the
expected value is the theoretical probability of success, and the
average of n such variables (assuming they are independent and
identically distributed (i.i.d.)) is precisely the relative frequency.

This seem to be mistaken already in claiming that from a mathematical theorem anything can follow about empirical probability (the page on which defines it as the relative frequency in actual experiment), but there are many more subtle claims that technically also seem erroneous from the above considerations:

The LLN is important because it "guarantees" stable long-term results for the averages of random events.

Note that the Wikipedia article about LLN claims to be about the mathematical theorem, not about the empirical observation, which was also historically sometimes been called the LLN. It seems to me that LLN does nothing to "guarantee stable long-term results", for as stated above those stable long-term results have to be assumed in the first place for the terms occuring in the derivation to have the intuitive meaning we typically ascribe to them, not to mention something has to be done to at all interpret $P()$ in the first place. Another instance from Wikipedia:

According to the law of large numbers, if a large number of six-sided die are rolled, the average of their values (sometimes called the sample mean) is likely to be close to 3.5, with the precision increasing as more dice are rolled.

Does this really follow from the mathematical theorem? In my opinion, the interpretation of the theorem that is used here, rests on assuming this fact. There is a particularly vivid example in the "Treatise on probability" by Keynes of what happens when one follows the WLLN with even a slight deviation from this initial assumptions of p's being the relative frequencies in the limit of an infinite number of trials:

The following example from Czuber will be
sufficient for the purpose of illustration. Czuber’s argument is as
follows: In the period 1866–1877 there were registered in Austria

m = 4,311,076 male births

n = 4,052,193 female births

s = 8,363,269

for the succeeding period, 1877–1899, we are given only

m' = 6,533,961 male births;

what conclusion can we draw as to the number n of female births? We
can conclude, according to Czuber, that the most probable value

n' = nm'/m = 6,141,587

and that there is a probability P = .9999779 that n will lie between
the limits 6,118,361 and 6,164,813. It seems in plain opposition to
good sense that on such evidence we should be able with practical
certainty P = .9999779 = 1 − 1/45250 to estimate the number of female
births within such narrow limits. And we see that the conditions laid
down in § 11 have been flagrantly neglected. The number of cases, over
which the prediction based on Bernoulli’s Theorem is to extend,
actually exceeds the number of cases upon which the à priori
probability has been based. It may be added that for the period,
1877–1894, the actual value of n did lie between the estimated limits,
but that for the period, 1895–1905, it lay outside limits to which the
same method had attributed practical certainty.

Am I mistaken in my reasoning above, or are all those really mistakes in the Wikipedia? I have seen similar statements all over the place in textbooks, and I am honestly wondering what I am missing.

Best Answer

I. I agree with you that no version of the Law of Large Numbers tells us something about real life frequencies, already for the reason that no purely mathematical statement tells us anything about real life at all, without first giving the mathematical objects in it a "real life interpretation" (which never can be stated, let alone "proven", within mathematics itself).

Rather, I think of the LLN as something which, within any useful mathematical model of probabilities and statistical experiments, should hold true! In the sense that: If you show me a new set of axioms for probability theory, which you claim have some use as a model for real life dice rolling etc.; and those axioms do not imply some version of the Law of Large Numbers -- then I would dismiss your axiom system, and I think so should you.


II. Most people would agree there is a real life experiment which we can call "tossing a fair coin" (or "rolling a fair die", "spinning a fair roulette wheel" ...), where we have a clearly defined finite set of outcomes, none of the outcomes is more likely than any other, we can repeat the experiment as many times as we want, and the outcome of the next experiment has nothing to do with any outcome we have so far.

And we could be interested in questions like: Should I play this game where I win/lose this much money in case ... happens? Is it more likely that after a hundred rolls, the added number on the dice is between 370 and 380, or between 345 and 350? Etc.

To gather quantitative insight into answering these questions, we need to model the real life experiment with a mathematical theory. One can debate (but again, such a debate happens outside of mathematics) what such a model could tell us, whether it could tell us something with certainty, whatever that might mean; but most people would agree that it seems we can get some insight here by doing some kind of math.

Indeed, we are looking for two things which only together have any chance to be of use for real life: namely, a "purely" mathematical theory, together with a real life interpretation (like a translation table) thereof, which allows us to perform the routine we (should) always do:

Step 1: Translate our real life question into a question in the mathematical model.

Step 2: Use our math skills to answer the question within the model.

Step 3: Translate that answer back into the real life interpretation.

The axioms of probability, as for example Kolmogorov's, do that: They provide us with a mathematical model which will give out very concrete answers. As with every mathematical model, those concrete answers -- say, $P(\bar X_{100} \in [3.45,3.5]) > P(\bar X_{100} \in [3.7,3.8])$ -- are absolutely true within the mathematical theory (foundational issues a la Gödel aside for now). They also come with a standard interpretation (or maybe, a standard set of interpretations, one for each philosophical school). None of these interpretations are justifiable by mathematics itself; and what any result of the theory (like $P(\bar X_{100} \in [3.45,3.5]) > P(\bar X_{100} \in [3.7,3.8])$) tells us about our real life experiment is not a mathematical question. It is philosophical, and very much up to debate. Maybe a frequentist would say, this means that if you roll 100 dice again and again (i.e. performing kind of a meta-experiment, where each individual experiment is already 100 "atomic experiments" averaged), then the relative frequency of ... is greater than the relative frequency of ... . Maybe a Bayesian would say, well it means that if you have some money to spare, and somebody gives you the alternative to bet on this or that outcome, you should bet on this, and not that. Etc.


III. Now consider the following statement, which I claim would be accepted by almost everyone:

( $\ast$ ) "If you repeat a real life experiment of the above kind many times, then the sample means should converge to (become a better and better approximation of) the ideal mean".

A frequentist might smirkingly accept ($\ast$), but quip that it's is true by definition, because he might claim that any definition of such an "ideal mean" beyond "what the sample means converge to" is meaningless. A Bayesian might explain the "ideal mean" as, well you know, the average -- like if you put it in a histogram, see, here is the centre of weight -- the outcome you would bet on -- you know! And she might be content with that. And she would say, yes, of course that is related to relative frequencies exactly in the sense of ($\ast$).

I want to strees that ($\ast$) is not a mathematical statement. It is a statement about real life experiments, which we claim to be true, although we might not agree on why we do so: depending on your philosophical background, you can see it as a tautology or not, but even if you do it is not a mathematical tautology (it's not a mathematical statement at all), just maybe a philosophical one.

And now let's say we do want a model-plus-translation-table for our experiments from paragraph II. Such a model should contain an object which models [i.e. whose "real life translation" is] one "atomic" experiment: that is the random variable $X$, or to be precise, an infinite collection of i.i.d. random variables $X_1, X_2, ...$.

It contains something which models "the actual sample mean after $100,1000, ..., n$ trials": that is $\bar X_n := \frac{1}{n}\sum_1^n X_i$.

And it contains something which models "an ideal mean": that is $\mu=EX$.

So with that model-plus-translation, we can now formulate, within such model, a statement (or set of related statements) which, under the standard translation, appear to say something akin to ($\ast$).

And that is the (or are the various forms of the) Law of Large Numbers. And they are true within the model, and they can be derived from the axioms of that model.

So I would say: The fact that they hold true e.g. in Kolmogorov's Axioms means that these axioms pass one of the most basic tests they should pass: We have a philosophical statement about the real world, ($\ast$), which we believe to be true, and of the various ways we can translate it into the mathematical model, those translations are true in the model. The LLN is not a surprising statement on a meta-mathematical level for the following reason: Any kind of model for probability which, when used as model for the above real life experiment, would not give out a result which is the mathematical analogy of statement ($\ast$), should be thrown out!

In other words: Of course good probability axioms give out the Law of Large Numbers. They are made so that they give them out. If somebody proposed a set of mathematical axioms, and a real-life-translation-guideline for the objects in there, and any model-internal version of ($\ast$) would be wrong -- then that model should be deemed useless (both by frequentists and Bayesians, just for different reasons) to model the above real life experiments.


IV. I want to finish by pointing out one instance where your argument seems contradictory, which, when exposed, might make what I write above more plausible to you.

Let me simplify an argument of yours like this:

(A) A mathematical statement like the LLN in itself can never make any statement about real life frequencies.

(B) Many sources claim that LLN does make statements about real life frequencies. So they must be implicitly assuming more.

(C) As an example, you exhibit a Kolmogorov quote about applying probability theory to the real world, and say that it "seems equivalent to introducing the weak law of large numbers in a particular, slightly different form, as an additional axiom."

I agree with (A) and (B). But (C) is where I want you to pause and think: Were we not in agreement, cf. (A), that no mathematical statement can ever tell us something about real life frequencies? Then what kind of "additional axiom" would say that? Whatever the otherwise mistaken sources in (B) are implicitly assuming, and Kolmogorov himself talks about in (C), it cannot just be an "additional axiom", at least not a mathematical one: Because one can throw in as many mathematical axioms as one wants, they will never bridge the fundamental gap in (A).

I claim the thing that all the sources in (B) are implicitly assuming, and what Kolmogorov talks about in (C), is not an additional axiom within the mathematical theory. It is the meta-mathematical translation / interpretation that I talk about above, which in itself is not mathematical, and in particular cannot be introduced as an additional axiom within the theory.

I claim, indeed, most sources are very careless, in that they totally forget the translation / interpretation part between real life and mathematical model, i.e. the bridge we need to cross the gap in (A); i.e. steps 1 and 3 in the routine explained in paragraph II. Of course it is taught in any beginner's class that any model in itself (i.e. without a translation, without steps 1 and 3) is useless, but it is commonly forgotten already in the non-statistical sciences, and more so in statistics, which leads to all kind of confusions. We spend so much time and effort on step 2 that we often forget steps 1 and 3; also, step 2 can be taught and learned and put on exams, but steps 1 and 3 not so well: they go beyond mathematics, seem to fit better into a science or philosophy class (although I doubt they get a good enough treatment there either). However, if we forget them, we are left with a bunch of axioms linking together almost meaningless symbols; and the remnants of meaning which we, as humans, cannot help applying to these symbols, quickly seem to be nothing but circular arguments.