[Math] the motivation of defining weak derivative as it is

measure-theorysobolev-spacesweak-derivatives

I've been reading lately about reproducing kernel Hilbert spaces (RKHS) and Gaussian processes (GP) and during my studies I came across with the concept of weak derivative and Sobolev spaces. I have tried to get some sense from weak derivative by reading these two questions at the site:

What is the intuition behind a function being 'weakly differentiable'?

Intuition about weakly differentiable functions

So I have learned now that:

A function $u:\Omega\rightarrow\mathbb{R}$ is weakly differentiable
with $\alpha$th ($\alpha\in\mathbb{N}^n$) weak partial derivative
$v=D^{\alpha}u$, if for all compactly supported smooth functions
$\phi$ it holds that:

$$\int_{\Omega} u \,D^{\alpha}\phi\,dx =
(-1)^{|\alpha|}\int_{\Omega}v\phi\,dx.$$

Quoting earlier answers (by user Christopher A. Wong) from the links I provided above:

The basic intuition is that a weakly differentiable function looks
differentiable except for on sets of zero measure. This allows
functions that are not normally considered differentiable at "corners"
to have a weak derivative that is defined everywhere on the original
function's domain. The reason why weak derivatives ignore sets of zero
measure is precisely because weak derivatives are defined by
integrals, and integrals cannot see behavior on sets of zero measure.

Now I'm almost satisfied with this answer, but one unclarity remains:

Why is the weak derivative defined as it is? That is, when the idea of weak derivative was first considered, whoever it was, why did he select that particular equation above as the definition? Or was this definition a somewhat arbitrary choice, which simply suited to our purposes?

For example, one of the reasons of using squared error in many problems of statistics or optimization is (as I've understood) because the squared error has nice analytical properties, making the math easy, when compared with e.g. absolute error.

The reason why I'm asking this, is that sometimes it seems many techniques of mathematics rise like from a magicians hat. For a beginner like myself, it is difficult to picture the scenario and conditions that gave rise to that discovery or definition.

Best Answer

Laurent Schwartz had the idea. Distributions are linear functionals defined on test functions. They generalize functions in this sense. If $u$ is an actual function, then the corresponding distribution is the linear functional $$ \phi \mapsto \int u \;\phi\;dx $$ A linear functional not of this form may still be considered a generalized function.

Now if $u$ is a function and the derivative $v = D^\alpha u$ actually exists, then integration by parts shows $$ \int_{\Omega}v\;\phi\,dx = (-1)^{|\alpha|}\int_{\Omega} u \,D^{\alpha}\phi\;dx $$ [no boundary terms because $\phi$ vanishes outside a compact subset of $\Omega$]

Next, if $u$ s a function but $D^\alpha u$ does not exist in the classical sense, it is still true that the functional $$ \phi \mapsto (-1)^{|\alpha|}\int_{\Omega} u\; \,D^{\alpha}\phi\;dx $$ makes sense and defines a generalized function (a.k.a.Schwartz distribution). So things work out if we go ahead and call this functional $D^\alpha u$. [It is not a function, but a distribution.]