[Math] Probability Density Function Interpretation

normal distributionprobabilitystatistics

I am BEGINNING to study Statistics and Probability and am trying to understand what a probability density function is/is used for.

My current interpretation is:

The name function indicates to me something that provides an output dependent on the input I give it. Taking for example the PDF for the standard normal distribution (shown below);

$$
p(x) = \mathcal{N}(x;0,1) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}
$$

In my mind the above equation describes the probability/likelihood that a continuous random variable $x$ takes on a value in it's sample space (i.e. set of all possible values).

So lets say this PDF (normal distribution) describes the time taken for men to run a marathon (real average is about 4 hours).
If plotting this PDF the $y$-axis would contain non-negligible values for corresponding marathon times from around 2 hours (on the extreme left) to 6 hours (on the extreme right) with the average/mean centered at 4 hours.

If I programmed the PDF equation (above) into computer and then ran a script that requested a input $x$; I could provide any real valued input in the domain from $-\infty$ to $+\infty$ and the output of the PDF equation would give me the probability that a man would finish the race in that time?

Why is this useful; If i'm standing at the start line before the race begins and a competitor walks over to me and bets me $20 that he can finish the race in exactly 3 hours, if I know nothing else about him, his training regime etc… I can quickly take out my phone, run the script and enter the value 3 hours and the output can be interpreted as the probability the man will finish the race in exactly 3 hours? If I fancy the odds I might decide it is a good idea to accept his bet.

Questions related to my current understanding are as follows:

(1) Is the the above interpretation correct whole/partially?

(1.1) If partially then where exactly am I getting my wires crossed?

(2) Bonus Question: How would you link an understanding of standard deviation and/or variance into this example?

Best Answer

A couple of things that you may find useful

Continuos vs. Discrete

The distribution you use as an example (the normal distribution) is a continuos distribution, in the sense that the values the random variable can take is uncountable. Another examples of these variables are the $\beta$-distribution, the logit distribution, $\dots$ Here's a comprehensive list of continuos distributions. The deal with these distributions is that the probability that the variable takes a particular value is exactly zero. In this case, what has a meaning is the probability of getting a value in some measurable set. In you example, this would be to tell the script to calculate the probability of finishing in a time $t$ between $t_1$ and $t_2$ for $t_1<t_2$ given numbers.

$$ P(t_1 < t < t_2) = \int_{t_1}^{t_2}{\rm d}t~f(t) $$

This in contrast with discrete distributions, where the possible values that the variable can take are countable. A typical example is the result of throwing a dice, or flipping a coin. Here you can find another examples.

Why is this useful?

The list including the cases where knowing the probability distribution of a random variable is useful is rather long. Each field has its own application. I can give you a couple of examples that some people may consider useful.

Imagine you want to make an invest on the stock market. The prices fluctuate of stocks in general fluctuate and you're not sure if the commodity will devaluate (losing you money) or will go up. If you knew the probability distribution of the prices of the stock at a given time you could ask and answer yourself "what is the probability of loosing a fraction $x$ of my investment?"

Like this there are many other very interesting applications! Here's another one: quantum mechanics is in essence a theory that describes the statistical nature of subatomic entities. There, knowing the probability distribution associated with a given physical system, is knowing how the system behaves

Meaning of variance

In your example of the racers, imagine two situations, in both of them, competitors cross the final after 4 h in average

  1. 99% of the racers cross the line between 3:50 h and 4:10 h

  2. 99% of the racers cross the line between 2 h and 6 h

This tells you something about these two distributions. Clearly they are different. For example, in the second case, you need to include a longer interval to account for the same fraction of racers, so in a sense the distribution is broader, or with larger variance than the first one.

Related Question