Why is long-term relative frequency used to predict the probability of the event occuring in a single trial

intuitionprobabilityprobability theory

I am a beginner statistics student learning probability from a frequentist perspective. I am confused on the application of a 'probability' to the real world.

Probability is the relative frequency of an event, performed over a very large (theoretically, infite) trials. My question: how is this useful when trying to predict the outcome of the random process in a single trial?

For example, having a dice land on one has a probability of $\frac 16$. That's fine if we want to predict the chance of landing a one over a large number of dice rolls, but it's not accurate for predicting the outcome of the dice roll in a single roll.

So my question is what is the rationale of using long-term relative frequency for the chance of a single repetition of the experiment to predict an event? Is it just the 'best guess' we have?

I know logically it may just be the 'best guess' we have, but intuitively it's not clicking for me. An intuitive explanation would help with as much minimal math language as possible!

Edit: To clarify, this is my confusion. The relative frequency of landing on a one in (for example) 6 rolls of the dice is NOT equal to 1/6 nor does it have to come near 1/6. It only comes close to 1/6 after a large number of trials.

So my question is in one trial/roll of the dice why is 1/6 the best prediction we have for estimating the probability of landing a 1? Is it because the long-term relative frequency is about the best guess we have for the outcome of any single trial? Is there more to this than it just being the best guess?

Best Answer

Let $X\sim \textrm{Bernoulli}(p)$, assume $p$ not known. If we have IID samples $X_1,X_2,...,X_n$ and $N_1$ the total count of successes then the 'most likely' (maximum likelihood) $p$ is given by the argument maximizing the joint probability of the outcomes (they are IID so you just multiply the pmfs): $$p_n^*=\max_{p \in [0,1]}\prod_{k=1}^n(p\mathbb{I}_{\{1\}}(X_k)+(1-p)\mathbb{I}_{\{0\}}(X_k))=\max_{p\in [0,1]}p^{N_1}(1-p)^{n-N_1}$$ If $0<N_1<n$ we set the derivative to $0$ $$\frac{d}{dp}p^{N_1}(1-p)^{n-N_1}=\frac{(1-p)^{n-N_1}p^{N_1-1}(np-N_1)}{p-1}=0$$ $$\implies p^*_n=\frac{N_1}{n}$$ otherwise $p^*=0$ if $N_1=0$ and $p^*=1$ if $N_1=n$. This is the relative frequency of successes in our sample. Now we have for example $$E[p^*_n]=\frac{np}{n}=p$$ $$P(|p^*_n-p|>\varepsilon)\leq\frac{1}{\varepsilon^2}E[|p^*_n-p|^2]=\frac{1}{\varepsilon^2}\textrm{Var}[p^*_n]=\frac{p(1-p)}{\varepsilon^2n}\stackrel{n \to \infty}{\to} 0$$ So it has desirable properties and we use $p^*_n$ to estimate $p$ in frequentist stats. In other words $p^*$ is not the probability $p$ per se but the probability of $|p^*_n-p|$ being greater than arbitrary $\varepsilon>0$ goes to $0$ in the limit $n \to \infty$.

Related Question