I'm writing my own neural net from scratch (using Clojure).
For the activation function for the nodes, I'm using the Sigmoid Function.
I was messing around manually creating "trained" nets based on example logic gates from this page, and noticed that my AND gate wasn't giving me as nice of values back as I'd like. When giving it 1,1
, it would give back a number around 0.6 instead of a hard 1, or at least 0.99999. 0,0
would give back a number around 0.3 instead of a tiny number near 0, and 0,1
and 1,0
gave back a number around 0.4. I realized I needed to make the graph of the Sigmoid function "steeper" to get more extreme values, so I started playing with the function on Wolfram Alpha.
I noticed that if I change the constant e
to something larger, it creates a much steeper graph than the original kind of lazy curve. When I changed the activation function to the modified steeper version, I did in fact get back nicer values.
Now, my for my question:
I decided to try training my net while using the steeper version, and found that you can get fairly significant training speedups by making the graph steeper; although you seem to get diminishing returns once you change from e
to a value larger than 20. Here are the errors after 1 million fires and backpropagation corrections of the network for different "scaling" values when training an AND gate:
e
: 0.00103635315
: 0.000813344720
: 0.000593265450
: 0.0005182057100
: 0.0004771130
It seems as though increasing the scale of the sigmoid function increases how fast the net is able to learn; even when using the same learning rate.
Is there a rule of thumb regarding what value to use to scale the Sigmoid function when using it as an activation function?
Best Answer
Essentially, what you are doing here is making the gradient larger. This is similar to using a higher learning rate (not taking into account the changed distribution, see @Frobot's comment). You are increasing your learning rate in a workaround - but you shouldn't, instead tune the learning rate instead of inventing your own activation function.
If you dó want to manufacture your own activation function - make sure to test it on at least 25+ datasets. It might just work out for the AND gate.
List of activation functions
Learning rate
Update 10/6/20
I received this comment recently:
This comment is entirely true. I did not take into account the changed distribution. However, in addition to my answer above, I will shown the linke of thinking I had for others to think about: the sigmoid function is given by
$$f(x)=\frac{e^x}{1+e^x}$$
With derivative $f'(x)$. Now let us imagine that we replace $e=20$, so that our new custom activation function is given by
$$g(x)=\frac{20^x}{1+20^x}$$
We note that $\ln(20)\approx3$ and see that with $y\equiv3x$ we can write the above as
$$g(y)=\frac{e^y}{1+e^y}$$
And using the chain rule we can now show that
$$g'(y)=\frac{dy}{dx} f'(x)=3 f'(x)$$
We have now showed that if we replace $e=c$ where $c$ is a constant, that the new derivative of our custom sigmoid function is $\ln(c)$ times the original derivative of the sigmoid function. However, the error that is propagated backwards is indeed since $f(x)$ has changed, which has a different effect on the regression.