The cost function used with the sigmoid function was motivated by the maximum likelihood estimation, and $$\text{cost}=−y\log(h_0(x))−(1−y)\log(1−h_0(x))$$ is just another way of saying $$\text{cost}=−log(h_0(x))$$ when $y=1$ and $$\text{cost}=−\log(1−h_0(x))$$ when $y=0$.
Those motivations still exist no matter what the activation function is (sigmoid or hyperbolic tangent). I would map the hyperbolic function from the range (-1,1) to the same range as the sigmoid (0,1) so that:
$$\text{cost} = −\frac{y+1}{2} \log{\left(\frac{h_\theta(x)+1}{2}\right)}−(1− \frac{y+1}{2})\log\left(1−\frac{h_\theta(x)+1}{2}\right)$$
Where $$h_\theta = \tanh\left(\frac{2}{3}x \right)$$.
This will have a different gradient than the sigmoid. Good luck.
When I plot using the following R-code:
x <- seq(from = -2, to = 2, by = 0.01 )
y <- (0.666666667/1.7159*(1.7159-(x))*(1.7159+(x)))
y2 <- (1.7159*tanh(0.66666667*x))
plot(x,y2,col = "red")
points(x,y)
I get the following plot:
One of these is a sigmoid (red), one is not a great derivative (black). Notice the negative values. This is going to define a radius of convergence that shoots Newtons-methods toward infinity.
Now using this R-code:
x <- seq(from = -2, to = 2, by = 0.01 )
y <- 1.14393*(1/cosh(2*x/3))^2
y2 <- (1.7159*tanh(0.66666667*x))
plot(x,y2,col = "red", type = "b")
points(x,y)
I get this plot:
It is a more plausible graph of the derivative(black) for the sigmoid(red).
This was fun: link.
Edit:
Here are some basics on Tanh and friends.
- http://mathworld.wolfram.com/HyperbolicTangent.html
- http://mathworld.wolfram.com/HyperbolicCosine.html
- http://mathworld.wolfram.com/HyperbolicSine.html
Please notice in link 1 that the derivative of Hyperbolic Tangent is pow( hyperbolic_secant,2) and not pow( hyperbolic_cosine,2).
Best Answer
What seems to be meant by this definition is the range within which there is variation in the activation function (ie. where the activation is not saturated). For instance, for a tanh activation function, outside of $[-2,2]$, the activation function does not vary much, ie. its gradient is almost zero.
To compute this window, you can for instance compute the derivative of the activation function, choose a threshold $\epsilon$ under which you consider the derivative to be "small" (this is somewhat arbitrary, but values less than $10^{-2}$ of the derivative for an activation function taking values around $-1,1$ usually means relative "flatness".)
This matters for several reasons :
Since neural nets are usually trained using backpropagation, examples with activations falling in the saturated range (ie. outside the "active input range") of a neuron have no effect on the parameters of said neuron when computing the gradient (the gradient is essentially zero).
If say your features take extremely high values, and you initialize network weights at high values, a tanh unit for instance may be completely saturated for all examples when beginning training, and thus the network will not train at all. So you must take this saturation range into account when 1) scaling inputs and 2) initializing weights.
Generally, activation functions which do not saturate too much (RELu for instance) result in much faster & efficient training than saturating functions (sigmoid, tanh), precisely for the reasons above : consistently significant gradients and no saturation.