Neural Networks: Different Depths and Widths But Same Number of Parameters

linear algebraneural networksnonlinear systemnumerical methodssystems of equations

Right now I am doing a research project investigating how the depth of a Neural Network affects its capacity to learn. In order to do this, I wanted to test different Networks with the same number of parameters but with different depths and widths. The solution to this problem gets reduced to a system of equations with integer solutions. Nonetheless, I haven't come up with an idea to solve it, even by using numerical methods. Therefore, I wanted to ask if there's someone that could help with finding a solution.

Context

A sequential Neural Network is defined by a number of $n$ matrices that correspond to the weights of each layer ($\{W_i\}_{i = 1,\dots, n}$). The shape of each consecutive matrix is directly related to the previous one (since they are composed after applying a non-linear function) such that:
$$
W_i \in \mathbb{R}^{a\times b} \Leftrightarrow W_{i+1} \in \mathbb{R}^{b\times c}
$$

where $a, b, c \in \mathbb{N}$.

On the other hand, the number of parameters of a Neural Network is the same as the sum of the total number of entries of each matrix $W_i$. Assuming that the number of neurons in the input layer and the number of neurons in the output layer remain constant (let's call them $a$ and $c\in \mathbb{N}$) and the number of neurons the same for each hidden layer (let's call it $b_i \in \mathbb{N}$ for a network with $i$ hidden layers) we have that we can calculate the number of parameters in the network. For example, for a network with one hidden layer, the number of parameters would be such that:
$$
p_1 = ab_1 + b_1c
$$

Then, if we add another hidden layer, the number of parameters would be such that:
$$
p_2 = ab_2 + b_2b_2 + b_2c \\
= ab_2 + b_2^2 + b_2c
$$

Therefore, for the general case with $m$ hidden layers the number of parameters is defined by the equation:
$$
p_m = ab_m + \sum_{i=1}^{m}{b_m^2} + b_mc \\
= ab_m + mb_m^2 + b_mc \\
= \boxed{b_m (a + mb_m + c)}
$$

Question

In my case, since I want to create $m$ neural networks ($5 \leq m \leq 12$ would be enough for the purposes of my research) I would need to find a solution to find the $b_i$'s $\in \mathbb{N}$ such that all of the $m$ networks have the same number of parameters $P$, where $P$ can be any positive integer (hopefully the smallest one where the system has solutions).
$$
\begin{cases}
P = ab_1 + b_1^2 + cb_1\\
P = ab_2 + 2b_2^2 + cb_2 \\
\dots \\
P = ab_m + mb_m^2 + cb_m \\
\end{cases}
$$

$\Leftrightarrow$
$$
\begin{cases}
P = b_1 (a + b_1 + c)\\
P = b_2 (a + 2b_2 + c) \\
\dots \\
P = b_m (a + mb_m + c) \\
\end{cases}
$$

Is there a closed-form solution for this system of equations in the positive integers? That is, a solution such that we could express the set of $b_i$'s in terms of $a, c,$ and $m$ where $b_i \in \mathbb{N}, \forall i \in \{1, \dots, m\}$?

I would really appreciate any help. Thank you all in advance.

Edit: As @LutzLehmann pointed out, relaxing the constraints for the number coefficients to be in a band of 1% or 5% of $P$ would serve too as a solution for the problem. Regarding relaxing the number of neurons in the hidden layers to allow them to be different for the different layers, I think it would be useful for the project if the difference between the number of neurons in each layer do not differ in a great extent.

Best Answer

Assume all hidden layers have exactly the same size $b_m$. By the quadratic formula applied to your equation with a box, considering $b_m$ should be positive, one gets:

$$b_m \approx \hat{b_m} = \frac{\sqrt{(a + c)^2 + 4 m P} - (a + c)}{2 m}$$

where exact equality is unlikely. Pick $P$ and $m$, calculate and pick a nearby integer for $b_m$ as above, then calculate

$$P' = b_m (a + m b_m + c).$$

For example, classifying the famous Iris data set with $a = 4, c = 3, P = 1000$ I calculated the following table:

m b_m P' %oo
5 15 1005 50
6 13 936 -640
7 12 948 -520
8 11 924 -760
9 11 1045 450
10 10 970 -300
11 10 1070 700
12 9 954 -460

where the last column shows they are all within a ±8% interval, represented as ±800 in the table.

With $a = 28^2 = 784, c = 10, P = 100000$ (for example, a MNIST digit image classifier) I get:

m b_m P' %oo
5 83 100347 35
6 79 100172 17
7 76 100776 78
8 73 100594 59
9 70 99680 -32
10 68 100232 23
11 66 100320 32
12 64 99968 -3

satisying ±1% tolerance (which would be ±100 in the table).

I used this Haskell code:

p = ...
a = ...
c = ...
b m = fromInteger . round $
  (sqrt((a + c)^2 + 4 * m * p) - (a + c)) / (2 * m)
p' m = b m * (a + m * b m + c)

main = mapM_ putStrLn $
  [ intercalate "|"
  . ([""]++) . (++[""])
  . map (show . round)
  $ [ m, b m, p' m, 10000 * (p' m - p) / p ]
  | m <- [5..12]
  ]

Possibly better tolerances could be achieved for lower P values by using both nearby integers near $\hat{b_m}$ spread between different layers.

Related Question