[Math] Understanding (t) dt in a definition of pdf

statistics

I am learning stats from a book called All of Statistics and from Khan Academy, which is not as mathematically deep. I have an intuitive understanding of a pdf and cdf from Khan Academy.

All of statistics defines a cdf as

$F_x(x) = P(X \leq x) $

A few pages later it says that the function $ f_X $ is called the probability density function. It says we "a function is called the probability density function (pdf). We have that:

$F_x(x) = \int_{-\infty}^{x} f_x (t) dt $

What is the meaning of (t)dt in that definition? Ignoring the (t) dt part it would seem that the definition says that the cdf of x is the area under the pdf from negative infinity up to x. That makes sense to me. So what is the meaning $(t)dt$ in the definition? All of Statistics is a really careful book, so I am sure it is there for a reason.

Best Answer

You are correct to be thinking in terms of areas. The comments about $(t)$ and $dt$ are mathematically correct, but perhaps not useful at your mathematical level. (This is going to be a long answer, stop when you have your answer or it starts to get too mathematical.)

The total area under the density function of a random variable $X$ is $1.$ The probability that $X$ lies in a particular interval $(a, b]$, is written as $P(a < X \le b)$. It is the area beneath the density curve $f_X$ above the interval $(a,b].$ The notation $\int_a^b f_X(t)\,dt$ is the way mathematicians write that area.

The probability that the random variable $X$ is smaller than the number $x$ is written: $$P(X \le x) = P(-\infty < X \le x) = \int_{-\infty}^x f_X(t)\,dt.$$

[The 'variable of integration' $t$ is part of the process of numerical evaluation of the probability, not a part of the answer. The integral could just as well be written $\int_{-\infty}^x f_X(\xi)\, d\xi$ or $\int_{-\infty}^x f_X(Q)\, dQ,$ or with any other symbol. Hence the term "dummy variable."]

Once you have the CDF, you can use it to find probabilities of various intervals. For example, $$P(0 < X \le 1/2) = F_X(1/2) - F_X(0) = \int_0^{1/2} f(t)\,dt.$$

Here is a specific example: Suppose the density function of $X$ is $f_X(x) = 2x,$ for $x$ between 0 and 1, and $f_X(x) = 0$ for other values of $x.$ You can draw a sketch of it: mainly it looks like a right triangle with vertices at $(0,1), (1,2),$ and $(1,0).$ You can check that it encloses total area $1$--as a density function must.

If you want to find $P(0 < X \le 1/2)$ for this simple case, you can see that it is equal to 1/4. The area of the small triangle under $f_X(x)$ and above $(0, 1/2)$ is half its base times its height: $(1/2)(1/2)(1) = 1/4.$

If you know some calculus, you can find that the CDF of this random variable $X$ is $F_X(x) = x^2,$ for $0 < x \le 1.$ Then $$P(0 < X \le 1/2) = F_X(1/2) - F_X(0) = (1/2)^2 - 0 = 1/4.$$

Notes: (1) In this simple example, calculus isn't necessary because you can find areas under $f_X$ using elementary geometry. (2) Using $<$ on one side of inequalities and $\le$ on the other is just a 'convention' (habit). The $\le$ could just as well be $<$ because there is zero probability at any individual point. (3) In many cases (such as the simple example above) you never have to deal with $-\infty$ because all of the probability is in some finite interval. (4) In some examples, using calculus is impossible and other methods need to be used to find areas under density curves. The famous normal distribution ("bell-shaped curve") is an example of this. Instead of finding $F$, you use tables of probabilities, a calculator, or statistical software.

Related Question