Applying Leibniz’s integral rule to the Gaussian distribution’s normalization condition

calculusdistributionsintegralprobability

I'm working on problem 1.8 of Bishop's Pattern Recognition and Machine Learning and am having a hard time understanding one of the technical details in a solution that I found online. Specifically, the problem statement is to show that the expectation of $x^{2}$ under the Gaussian distribution is equal to $\mu^{2}$ + $\sigma^2$, and comes with the hint to differentiate both sides of the Gaussian normalization condition. I've attached an excerpt of the solution below.

enter image description here

To me, it appears that differentiation of the integral with respect to $\sigma^{2}$ as shown implicitly utilizes Leibniz integral rule, or something akin to it. However, the statement of Leibniz integral rule as I've seen it precludes usage in the case of indefinite integrals. In this sense, it appears to me that the differentiation carried out in the solution is technically "illegal", despite producing the correct value.

Can someone please help me to understand what's going on here? Does this solution utilize Leibniz integral rule, or am I misreading it? If Leibniz integral rule isn't being used, how instead could the integral be differentiated with respect to $\sigma^{2}$?

Best Answer

Nice question!

option 1

Derivatives are limits (of fractions), and moving limits inside an expectation often triggers an invocation of a theorem such as the monotone convergence theorem, Lebesgue's dominated convergence theorem, the uniformly integrable convergnce theorem or maybe even Fatou's lemma. The theorem @jbowman seems to be referring to is this one from Jeffrey Rosenthal's book that provides a condition that allows you to exchange the order of differentiation and integration. In the book it's numbered Proposition 9.2.1.


Let $\{F_t\}_{a<t<b}$ be a collection of random variables with finite expectations on a probability triple $(\Omega, \mathcal{F}, \mathbf{P})$. Suppose further that for each $\omega$ and $t\in (a,b)$, the derivative $F'_t(\omega)=\frac{\partial}{\partial t} F_t(\omega)$ exists. Furthermore, if there is a random variable $Y$ on the same probability triple so that $E(Y)<\infty$ and $|F'_t| \leq Y$ for all $t\in (a,b)$. Then:

  • $F'_t$ is a random variable with finite expectation;
  • $\phi(t)$ is differentiable with finite derivative $\phi'(t)=E(F'_t)$ for all $t\in (a,b)$, where $\phi(t)=E(F_t)$.

Note: $E$ refers to Lebesgue expectations, and this notation differs with your $E$.


Here's the proof. You'll notice that it does indeed use DCT, as @jbowman mentioned, as well as the mean value theorem.

Write

$$F'_t=\lim_{h\rightarrow 0} \frac{F_{t+h}-F_t}{h}$$

and notice $F'_t$ is a random variable (i.e. it's measurable) as it is the limit of random variables.

Furthermore, we have $E(|F'_t|)\leq E(Y)<\infty$.

By the mean value theorem, there is always a $t^*$ between $t+h$ and $t$, so that $\frac{F_{t+h}-F_t}{h}=F'_{t^*}$.

Then $|\frac{F_{t+h}-F_t}{h}|\leq Y$. By the dominated convergence theorem:

\begin{align*} \phi'(t) & =\lim_{h\rightarrow 0} \frac{\phi(t+h)-\phi(t) }{h}\\ &=\lim_{h\rightarrow 0} E \big ( \frac{F_{t+h}-F_t}{h} \big ) \\ &=E \big (\lim_{h\rightarrow 0} \frac{F_{t+h}-F_t}{h} \big )\\ &=E(F'_t). \end{align*}

Most of the $\LaTeX$ below is taken from some some slides from one of my courses.

To be honest, I'm not sure this is legitimate. The thing that tripped me up is that I always thought this theorem was for expectations with respect to finite measures (e.g. probability measures), not sigma-finite measures (e.g. integrating with respect to $dx$). In this option, you're not taking derivatives of densities. I'm hoping that it still works for that situation too, but I've never directly constructed Lebesgue integrals with respect to sigma-finite measures. The construction goes simple random variables, then nonnegative r.v.s, then general ones, and all the while you're dealing with finite probability measures that put mass of $1$ on the whole space. I guess it works for that situation, too? I was also getting tripped up with the theorem being unclear about whether $t$ can be a parameter for the random variable (it can't be).

option 2

A second option would be to fight the hint. It is straightforward to integrate directly to find $E[(X-\mu)^2]$ or to use Stein's Lemma, whose proof just uses integration by parts. The justification does not involve moving a limit inside the integral, but if you look at how it's written it appears to have a derivative on the inside of an integral. This derivative is not with respect to $\sigma^2$, though...

option 3: @whuber's approach

Option 3 is to use a Taylor approximation. Out of the three approaches it most directly answers your question, and it doesn't even assume normality. Fix $x \in \mathbb{R}$ and write the 1-d dimensional, second-order Taylor approximation

$$ f(x; \sigma_0^2+h) = f(x; \sigma_0^2) + \frac{d}{d\sigma^2}f(x; \sigma_0^2) h + \frac{h^2}{2} \frac{d^2}{d(\sigma^2)^2}f(x; \sigma_0^2) + R(x,\sigma^2) $$ where the remainder term is such that $R(x,\sigma^2)/h^2 \to 0$ as $h \to 0$. We're ignoring $\mu$ for the moment by supposing it's fixed, and technically we have a different approximation for each value of $x$.

Plugging it in \begin{align*} &\frac{d}{d\sigma^2}\int f(x;\sigma^2)dx \\ &= \lim_{h \to 0}\frac{\int f(x;\sigma^2+h)dx - \int f(x;\sigma^2)dx}{h} \\ &= \int \frac{d}{d\sigma^2}f(x; \sigma^2)dx + \underbrace{\lim_{h \to 0} \frac{h}{2} \int \frac{d^2}{d(\sigma^2)^2}f(x; \sigma^2)dx + \lim_{h \to 0} \int \frac{R(x, \sigma^2)}{h} dx}_{\text{hopefully zero!}} \\ \end{align*}

There are a lot of ways you can assume things to get the last two integrals to vanish. @whuber suggests the following: suppose $$ \left|\frac{d^2}{d(\sigma^2)^2}f(x; \sigma^2)\right| \le G(x) $$ where $G(x)$ is free of $\sigma^2$ and $\int G(x) dx < \infty$, then $$ \left|\frac{h}{2} \int \frac{d^2}{d(\sigma^2)^2}f(x; \sigma^2)dx\right| \le \frac{h}{2} \int |\frac{d^2}{d(\sigma^2)^2}f(x; \sigma^2)dx| \le \frac{h}{2} \int G(x) dx \to 0. $$

For the last integral, you might assume higher-order derivatives so you can write out the remainder term explicitly. This will allow you to take a similar approach.

Related Question