Probability – Meaning of Pr(dx, dy) in Integration and Statistics

integrationmachine learningmeasure-theoryprobabilitystatistics

The book The Elements of Statistical Learning by Hastie and others (page 18) defines the expected value of prediction error as
\begin{align}
\operatorname{EPE}(f) &= \operatorname E(Y – f(X))^2\\
& = \int [y – f(x)]^2 \Pr(dx, dy)
\end{align}

Why is it like above?
Why not as below to be consistent with any expected value definition?
$$ \operatorname{EPE}(f) = E(Y – f(x))^2 = \iint [y – f(x)]^2 \Pr(x,y) d(x) d(y)$$
What does $\Pr(dx, dy)$ even mean?

Best Answer

We are dealing with a probability density function. Since this is a bivariate density function our function outputs density per unit area (hence why an output $f(x,y)$ is not a valid probability measure on its own, and there is no restriction that $f(x,y) \leq 1$).

So we can argue for very small $\Delta x$ and $\Delta y$, $\Pr(\Delta x, \Delta y) \approx f(x,y) \, \Delta x \, \Delta y$ and in the limit as $\Delta x$ and $\Delta y$ go to 0 we have $\Pr(dx, dy) = f(x,y)\,dx\,dy$.