Solved – Comprehensive list of activation functions in neural networks with pros/cons

neural networksreferences

Are there any reference document(s) that give a comprehensive list of activation functions in neural networks along with their pros/cons (and ideally some pointers to publications where they were successful or not so successful)?

Best Answer

I'll start making a list here of the ones I've learned so far. As @marcodena said, pros and cons are more difficult because it's mostly just heuristics learned from trying these things, but I figure at least having a list of what they are can't hurt.

First, I'll define notation explicitly so there is no confusion:

Notation

This notation is from Neilsen's book.

A Feedforward Neural Network is a many layers of neurons connected together. It takes in an input, then that input "trickles" through the network and the neural network returns an output vector.

More formally, call $a^i_j$ the activation (aka output) of the $j^{th}$ neuron in the $i^{th}$ layer, where $a^1_j$ is the $j^{th}$ element in the input vector.

Then we can relate the next layer's input to it's previous via the following relation:

$$a^i_j = \sigma\bigg(\sum\limits_k (w^i_{jk} \cdot a^{i-1}_k) + b^i_j\bigg)$$

where

  • $\sigma$ is the activation function,
  • $w^i_{jk}$ is the weight from the $k^{th}$ neuron in the $(i-1)^{th}$ layer to the $j^{th}$ neuron in the $i^{th}$ layer,
  • $b^i_j$ is the bias of the $j^{th}$ neuron in the $i^{th}$ layer, and
  • $a^i_j$ represents the activation value of the $j^{th}$ neuron in the $i^{th}$ layer.

Sometimes we write $z^i_j$ to represent $\sum\limits_k (w^i_{jk} \cdot a^{i-1}_k) + b^i_j$, in other words, the activation value of a neuron before applying the activation function.

enter image description here

For more concise notation we can write

$$a^i = \sigma(w^i \times a^{i-1} + b^i)$$

To use this formula to compute the output of a feedforward network for some input $I \in \mathbb{R}^n$, set $a^1 = I$, then compute $a^2, a^3, \ldots, a^m$, where $m$ is the number of layers.

Activation Functions

(in the following, we will write $\exp(x)$ instead of $e^x$ for readability)

Identity

Also known as a linear activation function.

$$a^i_j = \sigma(z^i_j) = z^i_j$$

Identity

Step

$$a^i_j = \sigma(z^i_j) = \begin{cases} 0 & \text{if } z^i_j < 0 \\ 1 & \text{if } z^i_j > 0 \end{cases}$$

Step

Piecewise Linear

Choose some $x_{\min}$ and $x_{\max}$, which is our "range". Everything less than than this range will be 0, and everything greater than this range will be 1. Anything else is linearly-interpolated between. Formally:

$$a^i_j = \sigma(z^i_j) = \begin{cases} 0 & \text{if } z^i_j < x_{\min} \\ m z^i_j+b & \text{if } x_{\min} \leq z^i_j \leq x_{\max} \\ 1 & \text{if } z^i_j > x_{\max} \end{cases}$$

Where

$$m = \frac{1}{x_{\max}-x_{\min}}$$

and

$$b = -m x_{\min} = 1 - m x_{\max}$$

Piecewise Linear

Sigmoid

$$a^i_j = \sigma(z^i_j) = \frac{1}{1+\exp(-z^i_j)}$$

Sigmoid

Complementary log-log

$$a^i_j = \sigma(z^i_j) = 1 − \exp\!\big(−\exp(z^i_j)\big)$$

Complementary log-log

Bipolar

$$a^i_j = \sigma(z^i_j) = \begin{cases} -1 & \text{if } z^i_j < 0 \\ \ \ \ 1 & \text{if } z^i_j > 0 \end{cases}$$

Bipolar

Bipolar Sigmoid

$$a^i_j = \sigma(z^i_j) = \frac{1-\exp(-z^i_j)}{1+\exp(-z^i_j)}$$ Bipolar Sigmoid

Tanh

$$a^i_j = \sigma(z^i_j) = \tanh(z^i_j)$$

Tanh

LeCun's Tanh

See Efficient Backprop. $$a^i_j = \sigma(z^i_j) = 1.7159 \tanh\!\left( \frac{2}{3} z^i_j\right)$$

LeCun's Tanh

Scaled:

LeCun's Tanh Scaled

Hard Tanh

$$a^i_j = \sigma(z^i_j) = \max\!\big(-1, \min(1, z^i_j)\big)$$

Hard Tanh

Absolute

$$a^i_j = \sigma(z^i_j) = \mid z^i_j \mid$$

Absolute

Rectifier

Also known as Rectified Linear Unit (ReLU), Max, or the Ramp Function.

$$a^i_j = \sigma(z^i_j) = \max(0, z^i_j)$$

Rectifier

Modifications of ReLU

These are some activation functions that I have been playing with that seem to have very good performance for MNIST for mysterious reasons.

$$a^i_j = \sigma(z^i_j) = \max(0, z^i_j)+\cos(z^i_j)$$

ReLU cos

Scaled:

ReLU cos scaled

$$a^i_j = \sigma(z^i_j) = \max(0, z^i_j)+\sin(z^i_j)$$

ReLU sin

Scaled:

ReLU sin scaled

Smooth Rectifier

Also known as Smooth Rectified Linear Unit, Smooth Max, or Soft plus

$$a^i_j = \sigma(z^i_j) = \log\!\big(1+\exp(z^i_j)\big)$$

Smooth Rectifier

Logit

$$a^i_j = \sigma(z^i_j) = \log\!\bigg(\frac{z^i_j}{(1 − z^i_j)}\bigg)$$

Logit

Scaled:

Logit Scaled

Probit

$$a^i_j = \sigma(z^i_j) = \sqrt{2}\,\text{erf}^{-1}(2z^i_j-1)$$.

Where $\text{erf}$ is the Error Function. It can't be described via elementary functions, but you can find ways of approximating it's inverse at that Wikipedia page and here.

Alternatively, it can be expressed as

$$a^i_j = \sigma(z^i_j) = \phi(z^i_j)$$.

Where $\phi $is the Cumulative distribution function (CDF). See here for means of approximating this.

Probit

Scaled:

Probit Scaled

Cosine

See Random Kitchen Sinks.

$$a^i_j = \sigma(z^i_j) = \cos(z^i_j)$$.

Cosine

Softmax

Also known as the Normalized Exponential. $$a^i_j = \frac{\exp(z^i_j)}{\sum\limits_k \exp(z^i_k)}$$

This one is a little weird because the output of a single neuron is dependent on the other neurons in that layer. It also does get difficult to compute, as $z^i_j$ may be a very high value, in which case $\exp(z^i_j)$ will probably overflow. Likewise, if $z^i_j$ is a very low value, it will underflow and become $0$.

To combat this, we will instead compute $\log(a^i_j)$. This gives us:

$$\log(a^i_j) = \log\left(\frac{\exp(z^i_j)}{\sum\limits_k \exp(z^i_k)}\right)$$

$$\log(a^i_j) = z^i_j - \log(\sum\limits_k \exp(z^i_k))$$

Here we need to use the log-sum-exp trick:

Let's say we are computing:

$$\log(e^2 + e^9 + e^{11} + e^{-7} + e^{-2} + e^5)$$

We will first sort our exponentials by magnitude for convenience:

$$\log(e^{11} + e^9 + e^5 + e^2 + e^{-2} + e^{-7})$$

Then, since $e^{11}$ is our highest, we multiply by $\frac{e^{-11}}{e^{-11}}$:

$$\log(\frac{e^{-11}}{e^{-11}}(e^{11} + e^9 + e^5 + e^2 + e^{-2} + e^{-7}))$$

$$\log(\frac{1}{e^{-11}}(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18}))$$

$$\log(e^{11}(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18}))$$

$$\log(e^{11}) + \log(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18})$$

$$ 11 + \log(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18})$$

We can then compute the expression on the right and take the log of it. It's okay to do this because that sum is very small with respect to $\log(e^{11})$, so any underflow to 0 wouldn't have been significant enough to make a difference anyway. Overflow can't happen in the expression on the right because we are guaranteed that after multiplying by $e^{-11}$, all the powers will be $\leq 0$.

Formally, we call $m=\max(z^i_1, z^i_2, z^i_3, ...)$. Then:

$$\log\!(\sum\limits_k \exp(z^i_k)) = m + \log(\sum\limits_k \exp(z^i_k - m))$$

Our softmax function then becomes:

$$a^i_j = \exp(\log(a^i_j))=\exp\!\left( z^i_j - m - \log(\sum\limits_k \exp(z^i_k - m))\right)$$

Also as a sidenote, the derivative of the softmax function is:

$$\frac{d \sigma(z^i_j)}{d z^i_j}=\sigma^{\prime}(z^i_j)= \sigma(z^i_j)(1 - \sigma(z^i_j))$$

Maxout

This one is also a little tricky. Essentially the idea is that we break up each neuron in our maxout layer into lots of sub-neurons, each of which have their own weights and biases. Then the input to a neuron goes to each of it's sub-neurons instead, and each sub-neuron simply outputs their $z$'s (without applying any activation function). The $a^i_j$ of that neuron is then the max of all its sub-neuron's outputs.

Formally, in a single neuron, say we have $n$ sub-neurons. Then

$$a^i_j = \max\limits_{k \in [1,n]} s^i_{jk}$$

where

$$s^i_{jk} = a^{i-1} \bullet w^i_{jk} + b^i_{jk}$$

($\bullet$ is the dot product)

To help us think about this, consider the weight matrix $W^i$ for the $i^{\text{th}}$ layer of a neural network that is using, say, a sigmoid activation function. $W^i$ is a 2D matrix, where each column $W^i_j$ is a vector for neuron $j$ containing a weight for every neuron in the the previous layer $i-1$.

If we're going to have sub-neurons, we're going to need a 2D weight matrix for each neuron, since each sub-neuron will need a vector containing a weight for every neuron in the previous layer. This means that $W^i$ is now a 3D weight matrix, where each $W^i_j$ is the 2D weight matrix for a single neuron $j$. And then $W^i_{jk}$ is a vector for sub-neuron $k$ in neuron $j$ that contains a weight for every neuron in the previous layer $i-1$.

Likewise, in a neural network that is again using, say, a sigmoid activation function, $b^i$ is a vector with a bias $b^i_j$ for each neuron $j$ in layer $i$.

To do this with sub-neurons, we need a 2D bias matrix $b^i$ for each layer $i$, where $b^i_j$ is the vector with a bias for $b^i_{jk}$ each subneuron $k$ in the $j^{\text{th}}$ neuron.

Having a weight matrix $w^i_j$ and a bias vector $b^i_j$ for each neuron then makes the above expressions very clear, it's simply applying each sub-neuron's weights $w^i_{jk}$ to the outputs $a^{i-1}$ from layer $i-1$, then applying their biases $b^i_{jk}$ and taking the max of them.

Radial Basis Function Networks

Radial Basis Function Networks are a modification of Feedforward Neural Networks, where instead of using

$$a^i_j=\sigma\bigg(\sum\limits_k (w^i_{jk} \cdot a^{i-1}_k) + b^i_j\bigg)$$

we have one weight $w^i_{jk}$ per node $k$ in the previous layer (as normal), and also one mean vector $\mu^i_{jk}$ and one standard deviation vector $\sigma^i_{jk}$ for each node in the previous layer.

Then we call our activation function $\rho$ to avoid getting it confused with the standard deviation vectors $\sigma^i_{jk}$. Now to compute $a^i_j$ we first need to compute one $z^i_{jk}$ for each node in the previous layer. One option is to use Euclidean distance:

$$z^i_{jk}=\sqrt{\Vert(a^{i-1}-\mu^i_{jk}\Vert}=\sqrt{\sum\limits_\ell (a^{i-1}_\ell - \mu^i_{jk\ell})^2}$$

Where $\mu^i_{jk\ell}$ is the $\ell^\text{th}$ element of $\mu^i_{jk}$. This one does not use the $\sigma^i_{jk}$. Alternatively there is Mahalanobis distance, which supposedly performs better:

$$z^i_{jk}=\sqrt{(a^{i-1}-\mu^i_{jk})^T \Sigma^i_{jk} (a^{i-1}-\mu^i_{jk})}$$

where $\Sigma^i_{jk}$ is the covariance matrix, defined as:

$$\Sigma^i_{jk} = \text{diag}(\sigma^i_{jk})$$

In other words, $\Sigma^i_{jk}$ is the diagonal matrix with $\sigma^i_{jk}$ as it's diagonal elements. We define $a^{i-1}$ and $\mu^i_{jk}$ as column vectors here because that is the notation that is normally used.

These are really just saying that Mahalanobis distance is defined as

$$z^i_{jk}=\sqrt{\sum\limits_\ell \frac{(a^{i-1}_{\ell} - \mu^i_{jk\ell})^2}{\sigma^i_{jk\ell}}}$$

Where $\sigma^i_{jk\ell}$ is the $\ell^\text{th}$ element of $\sigma^i_{jk}$. Note that $\sigma^i_{jk\ell}$ must always be positive, but this is a typical requirement for standard deviation so this isn't that surprising.

If desired, Mahalanobis distance is general enough that the covariance matrix $\Sigma^i_{jk}$ can be defined as other matrices. For example, if the covariance matrix is the identity matrix, our Mahalanobis distance reduces to the Euclidean distance. $\Sigma^i_{jk} = \text{diag}(\sigma^i_{jk})$ is pretty common though, and is known as normalized Euclidean distance.

Either way, once our distance function has been chosen, we can compute $a^i_j$ via

$$a^i_j=\sum\limits_k w^i_{jk}\rho(z^i_{jk})$$

In these networks they choose to multiply by weights after applying the activation function for reasons.

This describes how to make a multi-layer Radial Basis Function network, however, usually there is only one of these neurons, and its output is the output of the network. It's drawn as multiple neurons because each mean vector $\mu^i_{jk}$ and each standard deviation vector $\sigma^i_{jk}$ of that single neuron is considered a one "neuron" and then after all of these outputs there is another layer that takes the sum of those computed values times the weights, just like $a^i_j$ above. Splitting it into two layers with a "summing" vector at the end seems odd to me, but it's what they do.

Also see here.

Radial Basis Function Network Activation Functions

Gaussian

$$\rho(z^i_{jk}) = \exp\!\big(-\frac{1}{2} (z^i_{jk})^2\big)$$

Gaussian

Multiquadratic

Choose some point $(x, y)$. Then we compute the distance from $(z^i_j, 0)$ to $(x, y)$:

$$\rho(z^i_{jk}) = \sqrt{(z^i_{jk}-x)^2 + y^2}$$

This is from Wikipedia. It isn't bounded, and can be any positive value, though I am wondering if there is a way to normalize it.

When $y=0$, this is equivalent to absolute (with a horizontal shift $x$).

Multiquadratic

Inverse Multiquadratic

Same as quadratic, except flipped:

$$\rho(z^i_{jk}) = \frac{1}{\sqrt{(z^i_{jk}-x)^2 + y^2}}$$

Inverse Multiquadratic

*Graphics from intmath's Graphs using SVG.

Related Question