Neural Networks – Proving Density for Function Approximation with Hidden Layer Perceptron

approximation-theoryfunctional-analysishilbert-spacesmeasure-theoryneural networks

I'm working on a problem related to function approximation within the $L^2\left(I_n\right)$ space of square-integrable functions:

Problem Statement:

Given a lemma without proof:

$\textit{Lemma}$: Let $g \in L^2\left(I_n\right)$ such that $\int_{\mathcal{H}} g(x) d x=0$, for any half-space $\mathcal{H}:=\left\{x: w^T x+\theta>\right.$ $0\} \cap I_n$. Then $g=0$ almost everywhere.

Note that by choosing a convenient value for the parameter $\theta$, the upper half-space may become the entire hypercube. Then, $g$, considered before, has a zero integral $\int_{I_n} g(x) d x=0$.

The current task is to show that any function $g \in L^2\left(I_n\right)$ can be approximated by the output of a one hidden layer perceptron where the activation function $\sigma(x)$ is the Heaviside step function, defined as:
$$
\sigma(x)= \begin{cases}1, & x \geq 0 \\ 0, & x<0\end{cases}
$$


Progress Made So Far:

I am examining the use of a one-hidden layer perceptron with Heaviside step function activation for approximating functions in $L^2\left(I_n\right)$. The approach constructs hyperrectangle approximations through the intersections of half-spaces generated by neuron outputs, proposing that these intersections can represent any square-integrable function in $L^2\left(I_n\right)$ through linear combinations.

However, I'm seeking advice on $\textit{formally}$ proving the density of these approximations in $L^2(I_n)$ and establishing a method for choosing perceptron parameters (weights and biases) that ensures any function $g \in L^2(I_n)$ can be approximated with arbitrary precision.

Any guidance on applying functional analysis or approximation theory principles to support this approximation technique in $\textit{rigorous math}$ would be appreciated.

Best Answer

Here is a fully detailed proof based on your given Lemma, which I will restate below

(Lemma) : Let $g \in L^2\left(I_n\right)$ such that $\int_{\mathcal{H}} g(x) d x=0$, for any half-space $\mathcal{H}:=\left\{x: w^T x+\theta>\right.$ $0\} \cap I_n$. Then $g=0$ almost everywhere.

You didn't define it, but assuming the standard machine learning setup, we define the family of one hidden layer perceptrons as $$\mathbf F :=\Big\{ f :\mathbb R^n\to\mathbb R,\ x\mapsto \alpha\cdot\sigma(w^Tx +\theta)\mid \alpha,\theta\in\mathbb R,w\in\mathbb R^n\Big\},\tag1$$ where $\sigma \equiv \mathbf 1\{\cdot\ge0\}$ is the Heaviside step function.

The goal is to show that $\mathbf F$ is dense in $L^2(I_n)$ (you didn't define it either, but I will assume that $I_n := [0,1]^n$ denotes the unit hypercube and identify elements of $\mathbf F$ with their restriction to $I_n$).

For the sake of contradiction, assume that $\overline{\mathbf F}$, the $L^2(I_n)$-closure of $\mathbf F$ is not equal to $L^2(I_n)$ : this implies from standard Hilbert space theory that there exists a non-zero $g\in L^2(I_n)$ such that $$ \int_{I_n} g(x) f(x)\ dx = 0,\quad \forall f\in\mathbf F. \tag2 $$ (again, I'm assuming that $L^2$ comes with the standard inner product)

But now if we recall the definition of $\mathbf F$ given in $(1)$, we have that $(2)$ can be equivalently rewritten $$\int_{I_n} \alpha g(x)\sigma(w^Tx +\theta)\ dx = 0,\quad \forall \alpha,\theta\in\mathbb R,w\in\mathbb R^n\tag3 $$ But now if we plug in the definition of Heaviside's step function, we have that for any $\alpha,\theta,w$ : \begin{align}\int_{I_n} \alpha g(x)\sigma(w^Tx +\theta)\ dx &\stackrel{(a)}{=}\int_{I_n\cap\{w^Tx+\theta\ge 0\}} \alpha g(x)\sigma(w^Tx +\theta)\ dx\\ &+ \int_{I_n\cap\{w^Tx+\theta< 0\}} \alpha g(x)\sigma(w^Tx +\theta)\ dx\\ &\stackrel{(b)}{=}\int_{I_n\cap\{w^Tx+\theta\ge 0\}} \alpha g(x)\ dx\\ &\stackrel{(c)}{=}\int_{I_n\cap\{w^Tx+\theta> 0\}} \alpha g(x)\ dx. \end{align} Where in $(a)$ we used the additivity of Lebesgue integral, in $(b)$ we applied the definition of Heaviside's step function, and in $(c)$ the fact that the set $\{w^Tx + \theta = 0\}$ has zero Lebesgue measure. Finally, by applying $(3)$ with $\alpha \equiv 1$, we find that our non-zero function $g$ satisfies $$\int_{I_n\cap\{w^Tx+\theta> 0\}} g(x)\ dx = 0,\quad \forall \theta\in\mathbb R,w\in\mathbb R^n.$$ However, from (Lemma), we know that such a function is necessarily equal to zero almost everywhere, or in other words, we have found a contradiction and so the proof is complete.

Related Question