Right now I am doing a research project investigating how the depth of a Neural Network affects its capacity to learn. In order to do this, I wanted to test different Networks with the same number of parameters but with different depths and widths. The solution to this problem gets reduced to a system of equations with integer solutions. Nonetheless, I haven't come up with an idea to solve it, even by using numerical methods. Therefore, I wanted to ask if there's someone that could help with finding a solution.
Context
A sequential Neural Network is defined by a number of $n$ matrices that correspond to the weights of each layer ($\{W_i\}_{i = 1,\dots, n}$). The shape of each consecutive matrix is directly related to the previous one (since they are composed after applying a non-linear function) such that:
$$
W_i \in \mathbb{R}^{a\times b} \Leftrightarrow W_{i+1} \in \mathbb{R}^{b\times c}
$$
where $a, b, c \in \mathbb{N}$.
On the other hand, the number of parameters of a Neural Network is the same as the sum of the total number of entries of each matrix $W_i$. Assuming that the number of neurons in the input layer and the number of neurons in the output layer remain constant (let's call them $a$ and $c\in \mathbb{N}$) and the number of neurons the same for each hidden layer (let's call it $b_i \in \mathbb{N}$ for a network with $i$ hidden layers) we have that we can calculate the number of parameters in the network. For example, for a network with one hidden layer, the number of parameters would be such that:
$$
p_1 = ab_1 + b_1c
$$
Then, if we add another hidden layer, the number of parameters would be such that:
$$
p_2 = ab_2 + b_2b_2 + b_2c \\
= ab_2 + b_2^2 + b_2c
$$
Therefore, for the general case with $m$ hidden layers the number of parameters is defined by the equation:
$$
p_m = ab_m + \sum_{i=1}^{m}{b_m^2} + b_mc \\
= ab_m + mb_m^2 + b_mc \\
= \boxed{b_m (a + mb_m + c)}
$$
Question
In my case, since I want to create $m$ neural networks ($5 \leq m \leq 12$ would be enough for the purposes of my research) I would need to find a solution to find the $b_i$'s $\in \mathbb{N}$ such that all of the $m$ networks have the same number of parameters $P$, where $P$ can be any positive integer (hopefully the smallest one where the system has solutions).
$$
\begin{cases}
P = ab_1 + b_1^2 + cb_1\\
P = ab_2 + 2b_2^2 + cb_2 \\
\dots \\
P = ab_m + mb_m^2 + cb_m \\
\end{cases}
$$
$\Leftrightarrow$
$$
\begin{cases}
P = b_1 (a + b_1 + c)\\
P = b_2 (a + 2b_2 + c) \\
\dots \\
P = b_m (a + mb_m + c) \\
\end{cases}
$$
Is there a closed-form solution for this system of equations in the positive integers? That is, a solution such that we could express the set of $b_i$'s in terms of $a, c,$ and $m$ where $b_i \in \mathbb{N}, \forall i \in \{1, \dots, m\}$?
I would really appreciate any help. Thank you all in advance.
Edit: As @LutzLehmann pointed out, relaxing the constraints for the number coefficients to be in a band of 1% or 5% of $P$ would serve too as a solution for the problem. Regarding relaxing the number of neurons in the hidden layers to allow them to be different for the different layers, I think it would be useful for the project if the difference between the number of neurons in each layer do not differ in a great extent.
Best Answer
Assume all hidden layers have exactly the same size $b_m$. By the quadratic formula applied to your equation with a box, considering $b_m$ should be positive, one gets:
$$b_m \approx \hat{b_m} = \frac{\sqrt{(a + c)^2 + 4 m P} - (a + c)}{2 m}$$
where exact equality is unlikely. Pick $P$ and $m$, calculate and pick a nearby integer for $b_m$ as above, then calculate
$$P' = b_m (a + m b_m + c).$$
For example, classifying the famous Iris data set with $a = 4, c = 3, P = 1000$ I calculated the following table:
where the last column shows they are all within a ±8% interval, represented as ±800 in the table.
With $a = 28^2 = 784, c = 10, P = 100000$ (for example, a MNIST digit image classifier) I get:
satisying ±1% tolerance (which would be ±100 in the table).
I used this Haskell code:
Possibly better tolerances could be achieved for lower P values by using both nearby integers near $\hat{b_m}$ spread between different layers.