Solved – Neural networks: Active input range of activation functions

neural networks

I'm playing with the Neural Network toolbox in MATLAB.
I've noted that each activation function (aka, transfer function) has 2 properties:

the output range, which, if I understand, is the codomain of the function, and
the active input range, which I really don't understand

For instance:

tansig (hyperbolic tangent sigmoid) has an output range of [-1,1] and an active input range of [-2,2].
logsig (log-sigmoid) has an output range of [0,1] and an active input range of [-4,4].
purelin (linear) has both an output range and an active input range of [-inf,+inf].
…

I'm really confused…

So, can you tell me:

What is it the active input range of an activation function
How can I compute the active input range for a custom activation function?

Thank you so much for your time.

Best Answer

What seems to be meant by this definition is the range within which there is variation in the activation function (ie. where the activation is not saturated). For instance, for a tanh activation function, outside of $[-2,2]$, the activation function does not vary much, ie. its gradient is almost zero.

To compute this window, you can for instance compute the derivative of the activation function, choose a threshold $\epsilon$ under which you consider the derivative to be "small" (this is somewhat arbitrary, but values less than $10^{-2}$ of the derivative for an activation function taking values around $-1,1$ usually means relative "flatness".)

This matters for several reasons :

Since neural nets are usually trained using backpropagation, examples with activations falling in the saturated range (ie. outside the "active input range") of a neuron have no effect on the parameters of said neuron when computing the gradient (the gradient is essentially zero).
If say your features take extremely high values, and you initialize network weights at high values, a tanh unit for instance may be completely saturated for all examples when beginning training, and thus the network will not train at all. So you must take this saturation range into account when 1) scaling inputs and 2) initializing weights.
Generally, activation functions which do not saturate too much (RELu for instance) result in much faster & efficient training than saturating functions (sigmoid, tanh), precisely for the reasons above : consistently significant gradients and no saturation.

Related Solutions

Solved – Neural net cost function for Hyperbolic Tangent activation

The cost function used with the sigmoid function was motivated by the maximum likelihood estimation, and $$\text{cost}=−y\log(h_0(x))−(1−y)\log(1−h_0(x))$$ is just another way of saying $$\text{cost}=−log(h_0(x))$$ when $y=1$ and $$\text{cost}=−\log(1−h_0(x))$$ when $y=0$.

Those motivations still exist no matter what the activation function is (sigmoid or hyperbolic tangent). I would map the hyperbolic function from the range (-1,1) to the same range as the sigmoid (0,1) so that:

$$\text{cost} = −\frac{y+1}{2} \log{\left(\frac{h_\theta(x)+1}{2}\right)}−(1− \frac{y+1}{2})\log\left(1−\frac{h_\theta(x)+1}{2}\right)$$

Where $$h_\theta = \tanh\left(\frac{2}{3}x \right)$$.

This will have a different gradient than the sigmoid. Good luck.

Solved – How to use 1.7159 * tanh(2/3 * x) as activation function

When I plot using the following R-code:

x <- seq(from = -2, to = 2, by = 0.01 )
y <- (0.666666667/1.7159*(1.7159-(x))*(1.7159+(x)))
y2 <- (1.7159*tanh(0.66666667*x)) 

plot(x,y2,col = "red")
points(x,y)

I get the following plot: plot of the give expressions

One of these is a sigmoid (red), one is not a great derivative (black). Notice the negative values. This is going to define a radius of convergence that shoots Newtons-methods toward infinity.

Now using this R-code:

x <- seq(from = -2, to = 2, by = 0.01 )
y <- 1.14393*(1/cosh(2*x/3))^2
y2 <- (1.7159*tanh(0.66666667*x)) 

plot(x,y2,col = "red", type = "b")
points(x,y)

I get this plot: updated plot of expressions

It is a more plausible graph of the derivative(black) for the sigmoid(red).

This was fun: link.

Edit:

Here are some basics on Tanh and friends.

Please notice in link 1 that the derivative of Hyperbolic Tangent is pow( hyperbolic_secant,2) and not pow( hyperbolic_cosine,2).

Best Answer

Related Solutions

Solved – Neural net cost function for Hyperbolic Tangent activation

Solved – How to use 1.7159 * tanh(2/3 * x) as activation function

Related Question