Logistic Regression – Why Choose Sigmoid Function Over Others

least squareslogisticneural networks

Why is the de-facto standard sigmoid function, $\frac{1}{1+e^{-x}}$, so popular in (non-deep) neural-networks and logistic regression?

Why don't we use many of the other derivable functions, with faster computation time or slower decay (so vanishing gradient occurs less). Few examples are on Wikipedia about sigmoid functions. One of my favorites with slow decay and fast calculation is $\frac{x}{1+|x|}$.

EDIT

The question is different to Comprehensive list of activation functions in neural networks with pros/cons as I'm only interested in the 'why' and only for the sigmoid.

Best Answer

Quoting myself from this answer to a different question:

In section 4.2 of Pattern Recognition and Machine Learning (Springer 2006), Bishop shows that the logit arises naturally as the form of the posterior probability distribution in a Bayesian treatment of two-class classification. He then goes on to show that the same holds for discretely distributed features, as well as a subset of the family of exponential distributions. For multi-class classification the logit generalizes to the normalized exponential or softmax function.

This explains why this sigmoid is used in logistic regression.

Regarding neural networks, this blog post explains how different nonlinearities including the logit / softmax and the probit used in neural networks can be given a statistical interpretation and thereby a motivation. The underlying idea is that a multi-layered neural network can be regarded as a hierarchy of generalized linear models; according to this, activation functions are link functions, which in turn correspond to different distributional assumptions.

Related Question