Solved – Neural networks: Active input range of activation functions

neural networks

I'm playing with the Neural Network toolbox in MATLAB.
I've noted that each activation function (aka, transfer function) has 2 properties:

  • the output range, which, if I understand, is the codomain of the function, and
  • the active input range, which I really don't understand

For instance:

  • tansig (hyperbolic tangent sigmoid) has an output range of [-1,1] and an active input range of [-2,2].
  • logsig (log-sigmoid) has an output range of [0,1] and an active input range of [-4,4].
  • purelin (linear) has both an output range and an active input range of [-inf,+inf].

I'm really confused…

So, can you tell me:

  • What is it the active input range of an activation function
  • How can I compute the active input range for a custom activation function?

Thank you so much for your time.

Best Answer

What seems to be meant by this definition is the range within which there is variation in the activation function (ie. where the activation is not saturated). For instance, for a tanh activation function, outside of $[-2,2]$, the activation function does not vary much, ie. its gradient is almost zero.

To compute this window, you can for instance compute the derivative of the activation function, choose a threshold $\epsilon$ under which you consider the derivative to be "small" (this is somewhat arbitrary, but values less than $10^{-2}$ of the derivative for an activation function taking values around $-1,1$ usually means relative "flatness".)

This matters for several reasons :

  • Since neural nets are usually trained using backpropagation, examples with activations falling in the saturated range (ie. outside the "active input range") of a neuron have no effect on the parameters of said neuron when computing the gradient (the gradient is essentially zero).

  • If say your features take extremely high values, and you initialize network weights at high values, a tanh unit for instance may be completely saturated for all examples when beginning training, and thus the network will not train at all. So you must take this saturation range into account when 1) scaling inputs and 2) initializing weights.

  • Generally, activation functions which do not saturate too much (RELu for instance) result in much faster & efficient training than saturating functions (sigmoid, tanh), precisely for the reasons above : consistently significant gradients and no saturation.