Meaning of the Terms “Dense” and “Arbitrary”: Approximation Theory and Neural Networks

I was reading about the mathematical background behind Neural Networks to try and understand why a Neural Network is in theory able to "work". I came across the following link: https://en.wikipedia.org/wiki/Universal_approximation_theorem

If I understand correctly, the original Universal Approximation Theory of Neural Networks was used to first show that a Neural Network with 2 Layers and "Arbitrary Depth" (e.g. an infinite number of neurons) is able to approximate any function to an arbitrary level of precision.

However, the "issue" with this theorem was that a Neural Network with an "arbitrarily large number of neurons" might be able to perfectly approximate any function – but a Neural Network with an "arbitrarily large number of neurons" will be computationally impossible (i.e. "too deep"). Thus, extensive research was done to extend the "fruits" of the Universal Approximation Theorem (i.e. ability of Neural Networks to approximate any function arbitrarily well) to Neural Networks of "non-arbitrary" (i.e. bounded) depth. Thus, how well can Neural Networks perform that have "many layers (instead of only two layers) but a non-arbitrary number of neurons (i.e. non-arbitrary depth)"?

I was reading the abstracts of these two papers:

I copy and pasted the abstracts below:

Abstract 1: The standard Universal Approximation Theorem for operator neural networks (NNs) holds for arbitrary width and bounded depth. Here, we prove that operator NNs of bounded width and arbitrary depth are universal approximators for continuous nonlinear operators. In our main result, we prove that for non-polynomial activation functions that are continuously differentiable at a point with a nonzero derivative, one can construct an operator NN of width five, whose inputs are real numbers with finite decimal representations, that is arbitrarily close to any given continuous nonlinear operator.
Abstract 2: The classical Universal Approximation Theorem holds for neural networks of arbitrary width and bounded depth. Here we consider the natural `dual' scenario for networks of bounded width and arbitrary depth. Precisely, let n be the number of inputs neurons, m be the number of output neurons, and let ρ be any nonaffine continuous function, with a continuous nonzero derivative at some point.

Based on the descriptions provided in these two abstracts – I think I am incorrectly understanding the Universal Approximation Theorem or I am incorrectly understanding the definitions of the words "bounded" and "arbitrary" . In both abstracts, it is mentioned that "the standard Universal Approximation Theorem for operator neural networks (NNs) holds for arbitrary width and bounded depth".

However, I thought it was precisely the opposite – the classic/original/standard Universal Approximation Theorem was intended for "bounded width and arbitrary depth" (e.g. 2 layers and infinite Neurons), and significant research was done to extend the results of this theorem to prove the existence of a class of Neural Networks with similar approximation capabilities having "arbitrary width and bounded depth".

Isn't this exactly the opposite?

Can someone please help in clarifying the confusion I am having?

Thank you!

PS: I have always been interested in understanding the mathematical definition of the term "Dense" :

Based on the definition of "Dense" provided here (https://en.wikipedia.org/wiki/Dense_set), the above statement seems to be suggesting that ALL Neural Networks belonging to "N" are "arbitrarily close" to ALL D-dimensional functions that map X to Y (i.e. the set "C").

In other words, does this mean that suppose you have the point "A1" belonging to some function "F1" (and the function "F1" is in the set "C") – in theory, a Neural Network belonging to the class "N" would be able to provide an approximation to the point "A1" (e.g. call this approximation as "A2") : And "A2" will be arbitrarily close to "A1"?
And of course, this statement regarding quality of approximation could be extended to all elements belonging to the same set as "A1" and all elements belonging to the same set as "A2"?

Thank you!

Best Answer

You do indeed have things backwards. Arbitrary width refers to a neural net with fixed input and output layers, and an intermediate layer than can be made as large as necessary. Arbitrary depth refers to a neural net with some fixed upper bound on the number of neurons in a layer but arbitrarily many layers.

George Cybenko proved the first version of the arbitrary width theorem for sigmoid activation functions in 1989 (from the Wikipedia page). The current version of the arbitrary width theorem is valid for any non-polynomial activation function, and essentially states that any continuous function can be approximated to an arbitrary degree of precision by some three layer neural net with a sufficiently large intermediate layer.

The major results for the arbitrary depth case are much more recent, and seem to have been mostly proven in the late 2010s. There are a couple variants of the arbitrary depth case, one for the ReLU activation function (https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) and a good bound on the width, and one for a more general activation function with a weaker bound on width. They both say essentially that, given input and output layers of fixed size, there is a some bound on width $B$ such that any continuous function can can be approximated by a neural net with at most $B$ neurons in each layer and sufficiently many layers.

When one set, say $X,$ is a dense subset of another, say $Y$, it means that any element of $Y$ can be approximated arbitrarily close by some element of $X$, not that all elements of $X$ are close to all elements of $Y$. In our case, each of the universal approximation theorems states that a certain space of functions generated by neural nets of either arbitrary width-bounded depth or bounded width-arbitrary depth is dense in the space of continuous functions from let's say $\mathbb{R}^n$ to $\mathbb{R}^m$.

m	b_m	P'	%oo
5	15	1005	50
6	13	936	-640
7	12	948	-520
8	11	924	-760
9	11	1045	450
10	10	970	-300
11	10	1070	700
12	9	954	-460

m	b_m	P'	%oo
5	83	100347	35
6	79	100172	17
7	76	100776	78
8	73	100594	59
9	70	99680	-32
10	68	100232	23
11	66	100320	32
12	64	99968	-3

Best Answer

Related Solutions

Neural Networks: Different Depths and Widths But Same Number of Parameters

Related Question