Logistic Regression – Why Choose Sigmoid Function Over Others

least squareslogisticneural networks

Why is the de-facto standard sigmoid function, $\frac{1}{1+e^{-x}}$, so popular in (non-deep) neural-networks and logistic regression?

Why don't we use many of the other derivable functions, with faster computation time or slower decay (so vanishing gradient occurs less). Few examples are on Wikipedia about sigmoid functions. One of my favorites with slow decay and fast calculation is $\frac{x}{1+|x|}$.

EDIT

The question is different to Comprehensive list of activation functions in neural networks with pros/cons as I'm only interested in the 'why' and only for the sigmoid.

Best Answer

Quoting myself from this answer to a different question:

In section 4.2 of Pattern Recognition and Machine Learning (Springer 2006), Bishop shows that the logit arises naturally as the form of the posterior probability distribution in a Bayesian treatment of two-class classification. He then goes on to show that the same holds for discretely distributed features, as well as a subset of the family of exponential distributions. For multi-class classification the logit generalizes to the normalized exponential or softmax function.

This explains why this sigmoid is used in logistic regression.

Regarding neural networks, this blog post explains how different nonlinearities including the logit / softmax and the probit used in neural networks can be given a statistical interpretation and thereby a motivation. The underlying idea is that a multi-layered neural network can be regarded as a hierarchy of generalized linear models; according to this, activation functions are link functions, which in turn correspond to different distributional assumptions.

Notation

This notation is from Neilsen's book.

A Feedforward Neural Network is a many layers of neurons connected together. It takes in an input, then that input "trickles" through the network and the neural network returns an output vector.

More formally, call $a^i_j$ the activation (aka output) of the $j^{th}$ neuron in the $i^{th}$ layer, where $a^1_j$ is the $j^{th}$ element in the input vector.

Then we can relate the next layer's input to it's previous via the following relation:

$$a^i_j = \sigma\bigg(\sum\limits_k (w^i_{jk} \cdot a^{i-1}_k) + b^i_j\bigg)$$

where

$\sigma$ is the activation function,
$w^i_{jk}$ is the weight from the $k^{th}$ neuron in the $(i-1)^{th}$ layer to the $j^{th}$ neuron in the $i^{th}$ layer,
$b^i_j$ is the bias of the $j^{th}$ neuron in the $i^{th}$ layer, and
$a^i_j$ represents the activation value of the $j^{th}$ neuron in the $i^{th}$ layer.

Sometimes we write $z^i_j$ to represent $\sum\limits_k (w^i_{jk} \cdot a^{i-1}_k) + b^i_j$, in other words, the activation value of a neuron before applying the activation function.

enter image description here

For more concise notation we can write

$$a^i = \sigma(w^i \times a^{i-1} + b^i)$$

To use this formula to compute the output of a feedforward network for some input $I \in \mathbb{R}^n$, set $a^1 = I$, then compute $a^2, a^3, \ldots, a^m$, where $m$ is the number of layers.

Activation Functions

(in the following, we will write $\exp(x)$ instead of $e^x$ for readability)

Identity

Also known as a linear activation function.

$$a^i_j = \sigma(z^i_j) = z^i_j$$

Identity

Step

$$a^i_j = \sigma(z^i_j) = \begin{cases} 0 & \text{if } z^i_j < 0 \\ 1 & \text{if } z^i_j > 0 \end{cases}$$

Step

Piecewise Linear

Choose some $x_{\min}$ and $x_{\max}$, which is our "range". Everything less than than this range will be 0, and everything greater than this range will be 1. Anything else is linearly-interpolated between. Formally:

$$a^i_j = \sigma(z^i_j) = \begin{cases} 0 & \text{if } z^i_j < x_{\min} \\ m z^i_j+b & \text{if } x_{\min} \leq z^i_j \leq x_{\max} \\ 1 & \text{if } z^i_j > x_{\max} \end{cases}$$

Where

$$m = \frac{1}{x_{\max}-x_{\min}}$$

and

$$b = -m x_{\min} = 1 - m x_{\max}$$

Piecewise Linear

Sigmoid

$$a^i_j = \sigma(z^i_j) = \frac{1}{1+\exp(-z^i_j)}$$

Sigmoid

Complementary log-log

$$a^i_j = \sigma(z^i_j) = 1 − \exp\!\big(−\exp(z^i_j)\big)$$

Complementary log-log

Bipolar

$$a^i_j = \sigma(z^i_j) = \begin{cases} -1 & \text{if } z^i_j < 0 \\ \ \ \ 1 & \text{if } z^i_j > 0 \end{cases}$$

Bipolar

Bipolar Sigmoid

$$a^i_j = \sigma(z^i_j) = \frac{1-\exp(-z^i_j)}{1+\exp(-z^i_j)}$$ Bipolar Sigmoid

Tanh

$$a^i_j = \sigma(z^i_j) = \tanh(z^i_j)$$

Tanh

LeCun's Tanh

See Efficient Backprop. $$a^i_j = \sigma(z^i_j) = 1.7159 \tanh\!\left( \frac{2}{3} z^i_j\right)$$

LeCun's Tanh

Scaled:

LeCun's Tanh Scaled

Hard Tanh

$$a^i_j = \sigma(z^i_j) = \max\!\big(-1, \min(1, z^i_j)\big)$$

Hard Tanh

Absolute

$$a^i_j = \sigma(z^i_j) = \mid z^i_j \mid$$

Absolute

Rectifier

Also known as Rectified Linear Unit (ReLU), Max, or the Ramp Function.

$$a^i_j = \sigma(z^i_j) = \max(0, z^i_j)$$

Rectifier

Modifications of ReLU

These are some activation functions that I have been playing with that seem to have very good performance for MNIST for mysterious reasons.

$$a^i_j = \sigma(z^i_j) = \max(0, z^i_j)+\cos(z^i_j)$$

Scaled:

$$a^i_j = \sigma(z^i_j) = \max(0, z^i_j)+\sin(z^i_j)$$

Scaled:

Smooth Rectifier

Also known as Smooth Rectified Linear Unit, Smooth Max, or Soft plus

$$a^i_j = \sigma(z^i_j) = \log\!\big(1+\exp(z^i_j)\big)$$

Smooth Rectifier

Logit

$$a^i_j = \sigma(z^i_j) = \log\!\bigg(\frac{z^i_j}{(1 − z^i_j)}\bigg)$$

Logit

Scaled:

Logit Scaled

Probit

$$a^i_j = \sigma(z^i_j) = \sqrt{2}\,\text{erf}^{-1}(2z^i_j-1)$$.

Where $\text{erf}$ is the Error Function. It can't be described via elementary functions, but you can find ways of approximating it's inverse at that Wikipedia page and here.

Alternatively, it can be expressed as

$$a^i_j = \sigma(z^i_j) = \phi(z^i_j)$$.

Where $\phi $is the Cumulative distribution function (CDF). See here for means of approximating this.

Probit

Scaled:

Probit Scaled

Cosine

See Random Kitchen Sinks.

$$a^i_j = \sigma(z^i_j) = \cos(z^i_j)$$.

Cosine

Softmax

Also known as the Normalized Exponential. $$a^i_j = \frac{\exp(z^i_j)}{\sum\limits_k \exp(z^i_k)}$$

This one is a little weird because the output of a single neuron is dependent on the other neurons in that layer. It also does get difficult to compute, as $z^i_j$ may be a very high value, in which case $\exp(z^i_j)$ will probably overflow. Likewise, if $z^i_j$ is a very low value, it will underflow and become $0$.

To combat this, we will instead compute $\log(a^i_j)$. This gives us:

$$\log(a^i_j) = \log\left(\frac{\exp(z^i_j)}{\sum\limits_k \exp(z^i_k)}\right)$$

$$\log(a^i_j) = z^i_j - \log(\sum\limits_k \exp(z^i_k))$$

Here we need to use the log-sum-exp trick:

Let's say we are computing:

$$\log(e^2 + e^9 + e^{11} + e^{-7} + e^{-2} + e^5)$$

We will first sort our exponentials by magnitude for convenience:

$$\log(e^{11} + e^9 + e^5 + e^2 + e^{-2} + e^{-7})$$

Then, since $e^{11}$ is our highest, we multiply by $\frac{e^{-11}}{e^{-11}}$:

$$\log(\frac{e^{-11}}{e^{-11}}(e^{11} + e^9 + e^5 + e^2 + e^{-2} + e^{-7}))$$

$$\log(\frac{1}{e^{-11}}(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18}))$$

$$\log(e^{11}(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18}))$$

$$\log(e^{11}) + \log(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18})$$

$$ 11 + \log(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18})$$

We can then compute the expression on the right and take the log of it. It's okay to do this because that sum is very small with respect to $\log(e^{11})$, so any underflow to 0 wouldn't have been significant enough to make a difference anyway. Overflow can't happen in the expression on the right because we are guaranteed that after multiplying by $e^{-11}$, all the powers will be $\leq 0$.

Formally, we call $m=\max(z^i_1, z^i_2, z^i_3, ...)$. Then:

$$\log\!(\sum\limits_k \exp(z^i_k)) = m + \log(\sum\limits_k \exp(z^i_k - m))$$

Our softmax function then becomes:

$$a^i_j = \exp(\log(a^i_j))=\exp\!\left( z^i_j - m - \log(\sum\limits_k \exp(z^i_k - m))\right)$$

Also as a sidenote, the derivative of the softmax function is:

$$\frac{d \sigma(z^i_j)}{d z^i_j}=\sigma^{\prime}(z^i_j)= \sigma(z^i_j)(1 - \sigma(z^i_j))$$

Maxout

This one is also a little tricky. Essentially the idea is that we break up each neuron in our maxout layer into lots of sub-neurons, each of which have their own weights and biases. Then the input to a neuron goes to each of it's sub-neurons instead, and each sub-neuron simply outputs their $z$'s (without applying any activation function). The $a^i_j$ of that neuron is then the max of all its sub-neuron's outputs.

Formally, in a single neuron, say we have $n$ sub-neurons. Then

$$a^i_j = \max\limits_{k \in [1,n]} s^i_{jk}$$

where

$$s^i_{jk} = a^{i-1} \bullet w^i_{jk} + b^i_{jk}$$

($\bullet$ is the dot product)

To help us think about this, consider the weight matrix $W^i$ for the $i^{\text{th}}$ layer of a neural network that is using, say, a sigmoid activation function. $W^i$ is a 2D matrix, where each column $W^i_j$ is a vector for neuron $j$ containing a weight for every neuron in the the previous layer $i-1$.

If we're going to have sub-neurons, we're going to need a 2D weight matrix for each neuron, since each sub-neuron will need a vector containing a weight for every neuron in the previous layer. This means that $W^i$ is now a 3D weight matrix, where each $W^i_j$ is the 2D weight matrix for a single neuron $j$. And then $W^i_{jk}$ is a vector for sub-neuron $k$ in neuron $j$ that contains a weight for every neuron in the previous layer $i-1$.

Likewise, in a neural network that is again using, say, a sigmoid activation function, $b^i$ is a vector with a bias $b^i_j$ for each neuron $j$ in layer $i$.

To do this with sub-neurons, we need a 2D bias matrix $b^i$ for each layer $i$, where $b^i_j$ is the vector with a bias for $b^i_{jk}$ each subneuron $k$ in the $j^{\text{th}}$ neuron.

Having a weight matrix $w^i_j$ and a bias vector $b^i_j$ for each neuron then makes the above expressions very clear, it's simply applying each sub-neuron's weights $w^i_{jk}$ to the outputs $a^{i-1}$ from layer $i-1$, then applying their biases $b^i_{jk}$ and taking the max of them.

Radial Basis Function Networks

Radial Basis Function Networks are a modification of Feedforward Neural Networks, where instead of using

$$a^i_j=\sigma\bigg(\sum\limits_k (w^i_{jk} \cdot a^{i-1}_k) + b^i_j\bigg)$$

we have one weight $w^i_{jk}$ per node $k$ in the previous layer (as normal), and also one mean vector $\mu^i_{jk}$ and one standard deviation vector $\sigma^i_{jk}$ for each node in the previous layer.

Then we call our activation function $\rho$ to avoid getting it confused with the standard deviation vectors $\sigma^i_{jk}$. Now to compute $a^i_j$ we first need to compute one $z^i_{jk}$ for each node in the previous layer. One option is to use Euclidean distance:

$$z^i_{jk}=\sqrt{\Vert(a^{i-1}-\mu^i_{jk}\Vert}=\sqrt{\sum\limits_\ell (a^{i-1}_\ell - \mu^i_{jk\ell})^2}$$

Where $\mu^i_{jk\ell}$ is the $\ell^\text{th}$ element of $\mu^i_{jk}$. This one does not use the $\sigma^i_{jk}$. Alternatively there is Mahalanobis distance, which supposedly performs better:

$$z^i_{jk}=\sqrt{(a^{i-1}-\mu^i_{jk})^T \Sigma^i_{jk} (a^{i-1}-\mu^i_{jk})}$$

where $\Sigma^i_{jk}$ is the covariance matrix, defined as:

$$\Sigma^i_{jk} = \text{diag}(\sigma^i_{jk})$$

In other words, $\Sigma^i_{jk}$ is the diagonal matrix with $\sigma^i_{jk}$ as it's diagonal elements. We define $a^{i-1}$ and $\mu^i_{jk}$ as column vectors here because that is the notation that is normally used.

These are really just saying that Mahalanobis distance is defined as

$$z^i_{jk}=\sqrt{\sum\limits_\ell \frac{(a^{i-1}_{\ell} - \mu^i_{jk\ell})^2}{\sigma^i_{jk\ell}}}$$

Where $\sigma^i_{jk\ell}$ is the $\ell^\text{th}$ element of $\sigma^i_{jk}$. Note that $\sigma^i_{jk\ell}$ must always be positive, but this is a typical requirement for standard deviation so this isn't that surprising.

If desired, Mahalanobis distance is general enough that the covariance matrix $\Sigma^i_{jk}$ can be defined as other matrices. For example, if the covariance matrix is the identity matrix, our Mahalanobis distance reduces to the Euclidean distance. $\Sigma^i_{jk} = \text{diag}(\sigma^i_{jk})$ is pretty common though, and is known as normalized Euclidean distance.

Either way, once our distance function has been chosen, we can compute $a^i_j$ via

$$a^i_j=\sum\limits_k w^i_{jk}\rho(z^i_{jk})$$

In these networks they choose to multiply by weights after applying the activation function for reasons.

This describes how to make a multi-layer Radial Basis Function network, however, usually there is only one of these neurons, and its output is the output of the network. It's drawn as multiple neurons because each mean vector $\mu^i_{jk}$ and each standard deviation vector $\sigma^i_{jk}$ of that single neuron is considered a one "neuron" and then after all of these outputs there is another layer that takes the sum of those computed values times the weights, just like $a^i_j$ above. Splitting it into two layers with a "summing" vector at the end seems odd to me, but it's what they do.

Also see here.

Radial Basis Function Network Activation Functions

Gaussian

$$\rho(z^i_{jk}) = \exp\!\big(-\frac{1}{2} (z^i_{jk})^2\big)$$

Gaussian

Multiquadratic

Choose some point $(x, y)$. Then we compute the distance from $(z^i_j, 0)$ to $(x, y)$:

$$\rho(z^i_{jk}) = \sqrt{(z^i_{jk}-x)^2 + y^2}$$

This is from Wikipedia. It isn't bounded, and can be any positive value, though I am wondering if there is a way to normalize it.

When $y=0$, this is equivalent to absolute (with a horizontal shift $x$).

Multiquadratic

Inverse Multiquadratic

Same as quadratic, except flipped:

$$\rho(z^i_{jk}) = \frac{1}{\sqrt{(z^i_{jk}-x)^2 + y^2}}$$

Inverse Multiquadratic

*Graphics from intmath's Graphs using SVG.

Why Tanh Activation Function Outperforms Sigmoid in Neural Networks

Yan LeCun and others argue in Efficient BackProp that

Convergence is usually faster if the average of each input variable over the training set is close to zero. To see this, consider the extreme case where all the inputs are positive. Weights to a particular node in the first weight layer are updated by an amount proportional to $\delta x$ where $\delta$ is the (scalar) error at that node and $x$ is the input vector (see equations (5) and (10)). When all of the components of an input vector are positive, all of the updates of weights that feed into a node will have the same sign (i.e. sign($\delta$)). As a result, these weights can only all decrease or all increase together for a given input pattern. Thus, if a weight vector must change direction it can only do so by zigzagging which is inefficient and thus very slow.

This is why you should normalize your inputs so that the average is zero.

The same logic applies to middle layers:

This heuristic should be applied at all layers which means that we want the average of the outputs of a node to be close to zero because these outputs are the inputs to the next layer.

Postscript @craq makes the point that this quote doesn't make sense for ReLU(x)=max(0,x) which has become a widely popular activation function. While ReLU does avoid the first zigzag problem mentioned by LeCun, it doesn't solve this second point by LeCun who says it is important to push the average to zero. I would love to know what LeCun has to say about this. In any case, there is a paper called Batch Normalization, which builds on top of the work of LeCun and offers a way to address this issue:

It has been long known (LeCun et al., 1998b; Wiesler & Ney, 2011) that the network training converges faster if its inputs are whitened – i.e., linearly transformed to have zero means and unit variances, and decorrelated. As each layer observes the inputs produced by the layers below, it would be advantageous to achieve the same whitening of the inputs of each layer.

By the way, this video by Siraj explains a lot about activation functions in 10 fun minutes.

@elkout says "The real reason that tanh is preferred compared to sigmoid (...) is that the derivatives of the tanh are larger than the derivatives of the sigmoid."

I think this is a non-issue. I never seen this being a problem in the literature. If it bothers you that one derivative is smaller than another, you can just scale it.

The logistic function has the shape $\sigma(x)=\frac{1}{1+e^{-kx}}$. Usually, we use $k=1$, but nothing forbids you from using another value for $k$ to make your derivatives wider, if that was your problem.

Nitpick: tanh is also a sigmoid function. Any function with a S shape is a sigmoid. What you guys are calling sigmoid is the logistic function. The reason why the logistic function is more popular is historical reasons. It has been used for a longer time by statisticians. Besides, some feel that it is more biologically plausible.

EDIT

Best Answer

Related Solutions

Solved – Comprehensive list of activation functions in neural networks with pros/cons

Notation

Activation Functions

Identity

Step

Piecewise Linear

Sigmoid

Complementary log-log

Bipolar

Bipolar Sigmoid

Tanh

LeCun's Tanh

Hard Tanh

Absolute

Rectifier

Modifications of ReLU

Smooth Rectifier

Logit

Probit

Cosine

Softmax

Maxout

Radial Basis Function Networks

Radial Basis Function Network Activation Functions

Gaussian

Multiquadratic

Inverse Multiquadratic

Why Tanh Activation Function Outperforms Sigmoid in Neural Networks

Related Question