[Math] How to Find the Maximum of a Function Represented by a Back-Propagation Neural Network

machine learningneural networksnonlinear optimizationoptimization

First, I train a standard feed-forward neural network over a training set of data points. I get an approximate function, say $F(x)$, represented implicitly by that neural network.

1. How do I find its (global) maximum?

Note that the function $F(x)$ is not given explicitly; it is only implicitly realized by the neural network.

$F(x)$ should be differentiable, and gradient descent may be applied to it. But I'm not sure how this could be done given only the neural network.

2. Could the structure of the neural network be exploited so that I could find the maxima faster?

PS: the question is not the same as finding the maximum value of a set of data points — I need the neural network for other purposes, so the neural network must be trained first, and the maximum be found afterwards.

Best Answer

The only way I can think of doing this is to use a method similar to back propagation.

A traditional FFNN has L layers where each layer will have a set of input responses $a^{(l)} = \{a^{(l)}_1, a^{(l)}_2, \dots, a^{(l)}_n\}$ and a matrix of weights $W^{(l)}$ where $W^{(l)}_{jk}$ represents the weight connecting the j-th node from the previous layer to the i-th node of the current layer. There are also activation functions $f^{(l)}(z^{(l)}_k)$ where $z^{(l)}_k = [W^{(l)}_{1k}a^{(l)}_1, W^{(l)}_{2k}a^{(l)}_2, \dots, W^{(l)}_{nk}a^{(l)}_n]$ that are applied to every node in every layer. What you are looking for is the gradient of the activation function of the output layer $\nabla{f}^{(L)}$ (in this case just one node) with respect to the input of the first layer. Using the chain rule, $$ \nabla (f^{(l)}(z^{(l)}_k)) = \nabla f^{(l)}(z^{(l)}_k)\nabla z^{(l)}_k \\ = \nabla f^{(l)}(z^{(l)}_k) \cdot \sum_j W_{jk}\nabla(f^{(l - 1)}(z^{(l - 1)}_j)) $$

Using this equation you can calculate the gradient in terms of the input space through a process similar to back propagation where instead of propagating the gradient of the loss function, you are propagating the gradient of the activation function.

$\nabla f^{(l)}$ can be calculated analytically from the activation function. $z^{(l)}_k$ should already be calculated during the forward pass. By propagating backwards you will eventually calculate $\nabla_{a^{(1)}} f^{(L)}$

I'm not sure how to convert this for Stochastic Gradient Descent as there is no data set to sample from. Using regular gradient descent will likely get stuck on local optima. You could potentially use gradient descent as a local search method in a broader search policy such as basin hopping. Honestly, it may just be better to use black box optimization methods (e.g. Simulated Annealing, Genetic Algorithms, Particle Swarm Optimization, etc) instead of gradient-based methods.

Related Question