Solved – Effect of e when using the Sigmoid Function as an activation function

neural networkssigmoid-curve

I'm writing my own neural net from scratch (using Clojure).

For the activation function for the nodes, I'm using the Sigmoid Function.

I was messing around manually creating "trained" nets based on example logic gates from this page, and noticed that my AND gate wasn't giving me as nice of values back as I'd like. When giving it 1,1, it would give back a number around 0.6 instead of a hard 1, or at least 0.99999. 0,0 would give back a number around 0.3 instead of a tiny number near 0, and 0,1 and 1,0 gave back a number around 0.4. I realized I needed to make the graph of the Sigmoid function "steeper" to get more extreme values, so I started playing with the function on Wolfram Alpha.

I noticed that if I change the constant e to something larger, it creates a much steeper graph than the original kind of lazy curve. When I changed the activation function to the modified steeper version, I did in fact get back nicer values.

Now, my for my question:

I decided to try training my net while using the steeper version, and found that you can get fairly significant training speedups by making the graph steeper; although you seem to get diminishing returns once you change from e to a value larger than 20. Here are the errors after 1 million fires and backpropagation corrections of the network for different "scaling" values when training an AND gate:

  • e: 0.0010363531
  • 5: 0.0008133447
  • 20: 0.0005932654
  • 50: 0.0005182057
  • 100: 0.0004771130

It seems as though increasing the scale of the sigmoid function increases how fast the net is able to learn; even when using the same learning rate.

Is there a rule of thumb regarding what value to use to scale the Sigmoid function when using it as an activation function?

Best Answer

Essentially, what you are doing here is making the gradient larger. This is similar to using a higher learning rate (not taking into account the changed distribution, see @Frobot's comment). You are increasing your learning rate in a workaround - but you shouldn't, instead tune the learning rate instead of inventing your own activation function.

If you dó want to manufacture your own activation function - make sure to test it on at least 25+ datasets. It might just work out for the AND gate.

List of activation functions

Learning rate

Update 10/6/20

I received this comment recently:

This is not correct. Making the activation steeper completely changes the distribution of the outputs for that layer. It is not at all similar to increasing the learning rate. The same thing is often done in LSTM's to make the distribution of 1s and 0s in certain gates higher on the ends and lower in the middle. You had the right logic. You want the distribution of values in the middle (0.5) to be lower while increasing the distribution of 0s and 1s.

This comment is entirely true. I did not take into account the changed distribution. However, in addition to my answer above, I will shown the linke of thinking I had for others to think about: the sigmoid function is given by

$$f(x)=\frac{e^x}{1+e^x}$$

With derivative $f'(x)$. Now let us imagine that we replace $e=20$, so that our new custom activation function is given by

$$g(x)=\frac{20^x}{1+20^x}$$

We note that $\ln(20)\approx3$ and see that with $y\equiv3x$ we can write the above as

$$g(y)=\frac{e^y}{1+e^y}$$

And using the chain rule we can now show that

$$g'(y)=\frac{dy}{dx} f'(x)=3 f'(x)$$

We have now showed that if we replace $e=c$ where $c$ is a constant, that the new derivative of our custom sigmoid function is $\ln(c)$ times the original derivative of the sigmoid function. However, the error that is propagated backwards is indeed since $f(x)$ has changed, which has a different effect on the regression.